Skip to content

Variable length columns for CTable#624

Open
FrancescAlted wants to merge 10 commits intomainfrom
ctable-varlen-cols
Open

Variable length columns for CTable#624
FrancescAlted wants to merge 10 commits intomainfrom
ctable-varlen-cols

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

This allows for columns in CTable that can host variable length entities (typically large objects or lists of smaller objects).

This introduces a new ListArray object that centralizes variable length handling.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds variable-length list-valued columns to CTable via a new ListSpec schema type and a new ListArray container that abstracts variable-length storage over BatchArray/VLArray, including persistence and Arrow round-tripping.

Changes:

  • Introduces ListSpec + blosc2.list(...) schema API and schema (de)serialization support for list columns.
  • Adds blosc2.ListArray and integrates it into CTable storage, mutation, selection, persistence, and Arrow import/export.
  • Adds SChunk.reorder_offsets() plus tests; updates docs/examples/benchmarks to reflect new capabilities and performance tooling.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_schunk_reorder_offsets.py New tests for SChunk.reorder_offsets() correctness and error handling.
tests/test_list_array.py New tests for ListArray append/extend/indexing/persistence/Arrow roundtrip.
tests/ctable/test_varlen_schema_compiler.py Tests for list spec building, schema compilation, and schema dict roundtrip.
tests/ctable/test_varlen_columns.py New integration tests for list columns in CTable (append/extend/where/select/compact/persistence/Arrow).
tests/ctable/test_sort_by.py Updates view sorting behavior tests (inplace vs copy).
src/blosc2/schunk.py Adds Python-level SChunk.reorder_offsets() and open-dispatch for ListArray.
src/blosc2/schema_vectorized.py Extends batch validation to validate list cells via coerce_list_cell.
src/blosc2/schema_validation.py Extends row/rows validation to normalize and type list columns for Pydantic validation.
src/blosc2/schema_compiler.py Adds ListSpec awareness (dtype optional), list annotation validation, and schema (de)serialization for list specs.
src/blosc2/schema.py Introduces ListSpec and public blosc2.list(...) builder.
src/blosc2/list_array.py New ListArray implementation over BatchArray/VLArray, including Arrow support and metadata tagging.
src/blosc2/dict_store.py Adds support for persisting/discovering external ListArray leaves as .b2b.
src/blosc2/ctable_storage.py Extends table storage to create/open list columns and improves index sidecar path handling for .b2z.
src/blosc2/ctable.py Integrates list columns into CTable core operations (append/extend/select/compact/save/load/to_arrow/sort/copy/info).
src/blosc2/core.py Adds from_cframe() dispatch for ListArray.
src/blosc2/blosc2_ext.pyx Adds Cython binding for SChunk.reorder_offsets().
src/blosc2/init.py Exposes ListArray and list builder in the public API.
plans/ctable-varlen-cols.md Detailed design/implementation plan for variable-length columns.
examples/ctable/varlen_columns.py Example demonstrating list columns and ListArray usage.
examples/ctable/index_on_b2z.py Example demonstrating index persistence across .b2z roundtrip.
doc/reference/list_array.rst New reference docs for ListArray.
doc/reference/ctable.rst Updates CTable docs to mention list columns and blosc2.list.
doc/reference/classes.rst Adds ListArray to the documented class/module lists.
bench/ctable/where_selective.py Uses perf_counter for timing.
bench/ctable/where_chain.py Uses perf_counter; replaces unsupported boolean expression usage in DSL.
bench/ctable/varlen.py New benchmark for varlen list columns across backends and access patterns.
bench/ctable/speed_iter.py Reworks row-iteration benchmark with sampling and perf_counter.
bench/ctable/sort_by.py New benchmark for sort_by() performance across scenarios.
bench/ctable/slice_to_array.py Updates benchmark to use slicing directly (no .to_numpy()).
bench/ctable/slice_steps.py Updates benchmark to use slicing directly (no .to_numpy()).
bench/ctable/slice.py Uses perf_counter for timing.
bench/ctable/row_access.py Uses perf_counter for timing.
bench/ctable/print.py Reworks benchmark to compare ingestion + rendering cost with pandas using perf_counter.
bench/ctable/iteration_column.py Updates benchmark to use slicing directly (no .to_numpy()).
bench/ctable/iter_rows.py Reworks iteration benchmark cases and uses perf_counter.
bench/ctable/indexin.py New benchmark comparing index kinds vs scan across selectivities and data layouts.
bench/ctable/indexin.md Captured benchmark output for index kinds comparison.
bench/ctable/extend_vs_append.py Reworks benchmark comparing append vs extend strategies with perf_counter.
bench/ctable/extend.py Uses perf_counter for timing.
bench/ctable/expected_size.py Uses perf_counter for timing.
bench/ctable/delete.py Uses perf_counter for timing.
bench/ctable/ctable_v_pandas.py Updates benchmark to use slicing directly (no .to_numpy()).
bench/ctable/compact.py Uses perf_counter for timing.
bench/ctable/bench_persistency.py Updates benchmark to use slicing directly (no .to_numpy()).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/blosc2/list_array.py
Comment on lines +289 to +294
def extend(self, values: Iterable[Any], *, validate: bool = True) -> None:
if validate:
cells = [coerce_list_cell(self.spec, v) for v in values]
else:
cells = [v if v is not None else [] for v in values]
if self.spec.storage == "vl":
Comment thread src/blosc2/ctable.py
Comment on lines 1408 to +1411
def close(self) -> None:
"""Close any persistent backing store held by this table."""
with contextlib.suppress(Exception):
self._flush_varlen_columns()
Comment thread src/blosc2/ctable.py
Comment on lines +4806 to +4810
if isinstance(data, dict):
provided_names = set(data) & set(current_col_names)
new_nrows = len(next(iter(data.values())))
raw_columns = {name: data[name] for name in provided_names}
elif isinstance(data, np.ndarray) and data.dtype.names is not None:
Comment on lines +224 to +236
def test_sort_view_inplace_raises():
t = CTable(Row, new_data=DATA)
view = t.where(t["id"] > 2)
with pytest.raises(ValueError, match="view"):
view.sort_by("id")
with pytest.raises(ValueError, match="inplace"):
view.sort_by("id", inplace=True)


def test_sort_view_copy_works():
t = CTable(Row, new_data=DATA)
view = t.where(t["id"] > 2)
sorted_view = view.sort_by("id", ascending=False)
ids = [sorted_view["id"][i] for i in range(len(sorted_view))]
assert ids == sorted(ids, reverse=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants