Variable length columns for CTable by FrancescAlted · Pull Request #624 · Blosc/python-blosc2

FrancescAlted · 2026-04-23T07:42:48Z

This allows for columns in CTable that can host variable length entities (typically large objects or lists of smaller objects).

This introduces a new ListArray object that centralizes variable length handling.

Copilot

Pull request overview

Adds variable-length list-valued columns to CTable via a new ListSpec schema type and a new ListArray container that abstracts variable-length storage over BatchArray/VLArray, including persistence and Arrow round-tripping.

Changes:

Introduces ListSpec + blosc2.list(...) schema API and schema (de)serialization support for list columns.
Adds blosc2.ListArray and integrates it into CTable storage, mutation, selection, persistence, and Arrow import/export.
Adds SChunk.reorder_offsets() plus tests; updates docs/examples/benchmarks to reflect new capabilities and performance tooling.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_schunk_reorder_offsets.py	New tests for `SChunk.reorder_offsets()` correctness and error handling.
tests/test_list_array.py	New tests for `ListArray` append/extend/indexing/persistence/Arrow roundtrip.
tests/ctable/test_varlen_schema_compiler.py	Tests for list spec building, schema compilation, and schema dict roundtrip.
tests/ctable/test_varlen_columns.py	New integration tests for list columns in `CTable` (append/extend/where/select/compact/persistence/Arrow).
tests/ctable/test_sort_by.py	Updates view sorting behavior tests (inplace vs copy).
src/blosc2/schunk.py	Adds Python-level `SChunk.reorder_offsets()` and open-dispatch for `ListArray`.
src/blosc2/schema_vectorized.py	Extends batch validation to validate list cells via `coerce_list_cell`.
src/blosc2/schema_validation.py	Extends row/rows validation to normalize and type list columns for Pydantic validation.
src/blosc2/schema_compiler.py	Adds `ListSpec` awareness (dtype optional), list annotation validation, and schema (de)serialization for list specs.
src/blosc2/schema.py	Introduces `ListSpec` and public `blosc2.list(...)` builder.
src/blosc2/list_array.py	New `ListArray` implementation over `BatchArray`/`VLArray`, including Arrow support and metadata tagging.
src/blosc2/dict_store.py	Adds support for persisting/discovering external `ListArray` leaves as `.b2b`.
src/blosc2/ctable_storage.py	Extends table storage to create/open list columns and improves index sidecar path handling for `.b2z`.
src/blosc2/ctable.py	Integrates list columns into `CTable` core operations (append/extend/select/compact/save/load/to_arrow/sort/copy/info).
src/blosc2/core.py	Adds `from_cframe()` dispatch for `ListArray`.
src/blosc2/blosc2_ext.pyx	Adds Cython binding for `SChunk.reorder_offsets()`.
src/blosc2/init.py	Exposes `ListArray` and `list` builder in the public API.
plans/ctable-varlen-cols.md	Detailed design/implementation plan for variable-length columns.
examples/ctable/varlen_columns.py	Example demonstrating list columns and `ListArray` usage.
examples/ctable/index_on_b2z.py	Example demonstrating index persistence across `.b2z` roundtrip.
doc/reference/list_array.rst	New reference docs for `ListArray`.
doc/reference/ctable.rst	Updates `CTable` docs to mention list columns and `blosc2.list`.
doc/reference/classes.rst	Adds `ListArray` to the documented class/module lists.
bench/ctable/where_selective.py	Uses `perf_counter` for timing.
bench/ctable/where_chain.py	Uses `perf_counter`; replaces unsupported boolean expression usage in DSL.
bench/ctable/varlen.py	New benchmark for varlen list columns across backends and access patterns.
bench/ctable/speed_iter.py	Reworks row-iteration benchmark with sampling and `perf_counter`.
bench/ctable/sort_by.py	New benchmark for `sort_by()` performance across scenarios.
bench/ctable/slice_to_array.py	Updates benchmark to use slicing directly (no `.to_numpy()`).
bench/ctable/slice_steps.py	Updates benchmark to use slicing directly (no `.to_numpy()`).
bench/ctable/slice.py	Uses `perf_counter` for timing.
bench/ctable/row_access.py	Uses `perf_counter` for timing.
bench/ctable/print.py	Reworks benchmark to compare ingestion + rendering cost with pandas using `perf_counter`.
bench/ctable/iteration_column.py	Updates benchmark to use slicing directly (no `.to_numpy()`).
bench/ctable/iter_rows.py	Reworks iteration benchmark cases and uses `perf_counter`.
bench/ctable/indexin.py	New benchmark comparing index kinds vs scan across selectivities and data layouts.
bench/ctable/indexin.md	Captured benchmark output for index kinds comparison.
bench/ctable/extend_vs_append.py	Reworks benchmark comparing append vs extend strategies with `perf_counter`.
bench/ctable/extend.py	Uses `perf_counter` for timing.
bench/ctable/expected_size.py	Uses `perf_counter` for timing.
bench/ctable/delete.py	Uses `perf_counter` for timing.
bench/ctable/ctable_v_pandas.py	Updates benchmark to use slicing directly (no `.to_numpy()`).
bench/ctable/compact.py	Uses `perf_counter` for timing.
bench/ctable/bench_persistency.py	Updates benchmark to use slicing directly (no `.to_numpy()`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def extend(self, values: Iterable[Any], *, validate: bool = True) -> None:
+        if validate:
+            cells = [coerce_list_cell(self.spec, v) for v in values]
+        else:
+            cells = [v if v is not None else [] for v in values]
+        if self.spec.storage == "vl":


    def close(self) -> None:
        """Close any persistent backing store held by this table."""
+        with contextlib.suppress(Exception):
+            self._flush_varlen_columns()


+            if isinstance(data, dict):
+                provided_names = set(data) & set(current_col_names)
+                new_nrows = len(next(iter(data.values())))
+                raw_columns = {name: data[name] for name in provided_names}
+            elif isinstance(data, np.ndarray) and data.dtype.names is not None:


+def test_sort_view_inplace_raises():
    t = CTable(Row, new_data=DATA)
    view = t.where(t["id"] > 2)
-    with pytest.raises(ValueError, match="view"):
-        view.sort_by("id")
+    with pytest.raises(ValueError, match="inplace"):
+        view.sort_by("id", inplace=True)
+
+
+def test_sort_view_copy_works():
+    t = CTable(Row, new_data=DATA)
+    view = t.where(t["id"] > 2)
+    sorted_view = view.sort_by("id", ascending=False)
+    ids = [sorted_view["id"][i] for i in range(len(sorted_view))]
+    assert ids == sorted(ids, reverse=True)


FrancescAlted and others added 10 commits April 23, 2026 08:01

New plan for variable length columns based on new ListArray

550ffa2

First implementation for varlen cols in CTable

e100d76

First optimizations using a new benchmark

cb18236

Support for fancy indexing in CTable

0b66978

Better performance for monotonic indexing

9026375

Add a new SChunk.reorder_offsets() API

e055c53

Bench upgrade

fdb4f2a

Index on b2z

f205a1e

Merge branch 'main' into ctable-varlen-cols

9d542ec

Iter_sorted() added to CTable

7d2ba50

FrancescAlted requested a review from Copilot April 30, 2026 15:47

Copilot started reviewing on behalf of FrancescAlted April 30, 2026 15:48 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Variable length columns for CTable#624

Variable length columns for CTable#624
FrancescAlted wants to merge 10 commits intomainfrom
ctable-varlen-cols

FrancescAlted commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

FrancescAlted commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants