Skip to content

refactor/metadata package#3919

Open
d-v-b wants to merge 41 commits intozarr-developers:mainfrom
d-v-b:refactor/metadata-package
Open

refactor/metadata package#3919
d-v-b wants to merge 41 commits intozarr-developers:mainfrom
d-v-b:refactor/metadata-package

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 21, 2026

This PR creates a new subpackage called zarr-metadata just for JSON metadata. it's stored in packages/zarr_metadata. It contains typeddict classes that model the spec-defined JSON forms for v2 array + group metadata, and v3 array + group metadata, including data types, codecs, chunk key encodings and chunk grids. I only included type definitions for metadata that has an external spec. So zarr-python will need to define some types internally for e.g. unspecified data types or codecs.

I would like to publish this subpackage to pypi. These types useful to any python tool that works with zarr data, even if that tool doesn't use zarr-python. It is also useful to zarr-python, because it means we can remove and resolve some lingering questions about publishing types.

If we adopted the changes here, adding a new data type / codec / chunk grid, etc, would require adding types to zarr-metadata, then adding the implementation in zarr-python that work with those types. We wouldn't need to do these 2 operations in the same PR, but I expect that would be the normal practice.

This change does add complexity to our publishing workflow: we need to ensure that zarr-metadata changes are published at or before new zarr-python releases. We should add some checks to ensure that this happens.

Docs for the new package are missing from this PR. I would handle that in a follow-up.

I would appreciate feedback at all levels, including the following topics:

  • are we aligned on the idea of a subpackage?
  • are the types correct (do they match the spec documents)
  • what needs to change in our CI?

closes #3355 and #3795

d-v-b and others added 30 commits April 21, 2026 14:47
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t spec)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hBytesConfig, TimeConfig)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…om zarr-metadata

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…data

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed NumcodecsConfig

Spec-defined metadata fields with fixed length and no mutation semantics
are typed as tuples, not Sequence. Applies to:
  - v2 ArrayMetadataV2.shape, .chunks
  - v2 DataTypeV2Structured.shape
  - v2 ArrayMetadataV2.filters (tuple of codec configs)
  - v3 RegularChunkGridConfig.chunk_shape
  - v3 RectilinearChunkGridConfig.chunk_shapes

Adds zarr_metadata.v2.codec.NumcodecsConfig, a TypedDict modeling the v2
spec shape for compressors and filters: a required 'id' field plus
arbitrary codec-specific extras. ArrayMetadataV2.compressor and .filters
now reference this type instead of an untyped Mapping[str, JSON].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… migration

Three fixes:

1. Add missing "μs" unit to zarr_metadata.dtype.time.DateTimeUnit so it
   matches zarr-python's DateTimeUnit. zarr.core.dtype.npy.common.DateTimeUnit
   now re-exports from zarr-metadata (downstream consumers like
   zarr.core.dtype.npy.time pick it up transitively).

2. Replace `from X import Y as LegacyName` with `from X import Y` followed
   by a module-level `LegacyName: TypeAlias = Y` binding. mypy under
   `strict = true` rejected the renamed-import form under the explicit-
   re-export check ("Module 'X' does not explicitly export attribute 'Y'"),
   affecting 13 call sites across the codebase. The TypeAlias form makes
   the alias a proper type (mypy uses it in annotations) while preserving
   runtime introspection (`.__annotations__` access on the aliased TypedDict).

   Affects:
     - src/zarr/core/dtype/common.py    (DTypeJSON)
     - src/zarr/core/metadata/v2.py     (ArrayV2MetadataDict)
     - src/zarr/core/metadata/v3.py     (ArrayMetadataJSON_V3 + 5 others)

3. noqa: UP040 on the TypeAlias bindings. ruff prefers the `type` keyword
   (PEP 695), but that wraps the alias in a TypeAliasType which breaks
   `.__annotations__` lookup used by tests.

The 12 remaining "unused type: ignore" mypy errors in v3.py are
pre-existing (same count on the pre-refactor state) and unrelated to this
work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ycle

Moves JSON, NamedConfig, NamedRequiredConfig out of zarr_metadata/__init__.py
into zarr_metadata/common.py. Submodules (v2/*, v3/*) now import from
zarr_metadata.common directly, avoiding the circular import that occurred
when v2.codec was loaded during __init__.py execution.

Also:
  - v3.array declares RegularChunkGrid/RectilinearChunkGrid as direct
    TypedDict classes instead of NamedRequiredConfig aliases, simplifying
    the types and enabling more precise chunk-grid annotations downstream.
  - v2.consolidated.ConsolidatedMetadataV2.metadata value type widened to
    GroupMetadataV2 | ArrayMetadataV2 | JSON.
  - Added spec links to v2/{array,codec} docstrings.

zarr_metadata/__init__.py continues to re-export JSON, NamedConfig,
NamedRequiredConfig at the top level so zarr.core.common keeps resolving.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues surfaced by final code review:

1. Add py.typed marker to zarr-metadata. Without it, PEP 561 makes type
   checkers treat zarr-metadata as untyped, cascading into ~44 spurious
   mypy errors in zarr (subclassing Any, unused type: ignore, etc).

2. RegularChunkGrid.configuration was accidentally typed NotRequired when
   converted from NamedRequiredConfig to a direct TypedDict class. Per
   spec, chunk_shape is mandatory. Make configuration required.

3. RectilinearDimSpec was declared as tuples but zarr's compress_rle
   returned lists, and the to_dict producer built lists. Align producers
   with the declared type: compress_rle now returns list[int | tuple[int, int]],
   expand_rle accepts both list and tuple RLE pairs, to_dict builds tuples.

The tuple shape is correct per spec: each RLE pair is a JSON array of
exactly two elements (size, count) — a fixed-cardinality structure that
tuple models more faithfully than a mutable list.

Mypy error count now matches main (32) with these fixes in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
consolidated_metadata is not in the core Zarr v3 spec as a field on group
metadata. It has an (unmerged) extension spec and is implemented by
zarr-python, but keeping it out of GroupMetadataV3 is the spec-faithful
move. The extra_items=AllowedExtraField on GroupMetadataV3 already
permits it to appear at runtime as an extension.

ConsolidatedMetadataV3 remains available at zarr_metadata.v3.consolidated
for consumers that want to type the extension shape.

Also fix two stray lint issues (missing trailing newline in common.py,
unused Mapping import in v2/array.py).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zarr-metadata is a library, not an application — its lockfile pins
transitive dev versions that shouldn't be fixed in source. Untrack and
gitignore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d-v-b and others added 10 commits April 21, 2026 22:49
…nspose, sharding

Adds per-codec TypedDict configurations + name literals + full envelope
types for every core v3 codec besides blosc (which is extended in the
same style for consistency):

  - {Codec}CodecName       : Literal["<name>"]   — the spec "name" value
  - {Codec}CodecConfiguration  : TypedDict        — the "configuration" body
  - {Codec}Codec           : NamedRequiredConfig — the full envelope

crc32c has no configuration fields, so Crc32cCodec uses NamedConfig
(configuration optional) and no Configuration TypedDict is exported.

The `V1` suffix is dropped from the Configuration types (except blosc,
where V1 + Numcodecs disambiguate two concrete shapes). The other v3
codec specs aren't versioned at the codec level; there's only one shape
per codec today, and an incompatible future change would land under a
new codec name rather than a v2 of the same name.

Also fixes pre-existing v2 test fixtures to include the now-required
compressor/fill_value/order/filters fields on ArrayMetadataV2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each {Codec}Codec envelope type is now an explicit TypedDict class with
`name` and `configuration` fields, rather than a NamedRequiredConfig[...]
generic alias. Readable at the call site, surfaces the spec structure
directly, and allows a real class-level docstring.

Also:
  - Drop BloscCodecConfigurationNumcodecs from zarr-metadata. numcodecs-
    shape modeling belongs in zarr-python (which implements that shape),
    not in zarr-metadata (which is spec-only).
  - Rename BloscCodecConfigurationV1 to BloscCodecConfiguration, matching
    the unversioned naming used for the other codecs.
  - Restore BloscConfigV2 locally in zarr-python for the numcodecs shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alued fields

Each codec now exports SCREAMING_CASE Final constants alongside the
Literal types. Downstream packages can reference the spec-defined
strings without retyping magic strings.

Codec names: BLOSC_CODEC_NAME, BYTES_CODEC_NAME, CRC32C_CODEC_NAME,
GZIP_CODEC_NAME, SHARDING_CODEC_NAME, TRANSPOSE_CODEC_NAME, ZSTD_CODEC_NAME.

Enum-valued field values:
  - Blosc: BLOSC_SHUFFLE_{NOSHUFFLE,SHUFFLE,BITSHUFFLE},
    BLOSC_CNAME_{LZ4,LZ4HC,BLOSCLZ,SNAPPY,ZLIB,ZSTD}
  - Bytes: BYTES_ENDIAN_{LITTLE,BIG} (also extracts the existing
    Literal into a new `Endian` alias)
  - Sharding: SHARDING_INDEX_LOCATION_{START,END} (and `IndexLocation`
    Literal alias)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewords docstrings and test names throughout the package: the {Codec}Codec
TypedDict describes a codec's JSON metadata, not a "named-config envelope."
Less jargon, consistent with the package name. Identifier names are
unchanged (still BloscCodec, GzipCodec, etc.).

Also renames v3/array.py chunk-grid docstrings for consistency
(Regular/Rectilinear ChunkGrid "metadata" rather than "named-config
container"), and updates the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert all double-backtick RST-style inline code in zarr-metadata
docstrings to single-backtick markdown style. The package's
documentation will be rendered by mkdocs, which expects markdown, so
single backticks render correctly as inline code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Models the spec-defined v3 data types from zarr-specs core and
zarr-extensions:

  * `dtype/primitive.py` (NEW) - Final constants and `PrimitiveDTypeName`
    Literal union for all 14 core v3 primitives (bool, int8..int64,
    uint8..uint64, float16..float64, complex64, complex128).

  * `dtype/bytes.py` - adds `BYTES_DTYPE_NAME` and `BytesDTypeName` for
    the variable-length `bytes` extension; adds `NullTerminatedBytes`
    envelope TypedDict for `null_terminated_bytes` (zarr-extensions).
    Retains `FixedLengthBytesConfig` (re-exported by zarr-python).

  * `dtype/string.py` - adds `STRING_DTYPE_NAME`/`StringDTypeName` for
    the `string` extension; adds `FixedLengthUtf32` envelope. Retains
    `LengthBytesConfig`.

  * `dtype/time.py` - adds `NumpyDatetime64` and `NumpyTimedelta64`
    envelopes plus name constants/literals. The shared `TimeConfig` body
    is preserved.

  * `dtype/struct.py` (NEW) - the `struct` extension type, with
    `StructField`, `StructConfig`, and `Struct` envelope. Fields hold
    recursive `DType` values, supporting nested structs.

The `r<N>` raw-bytes type from the core spec is parameterised on bit
count, not a single literal name, so it isn't given a TypedDict; consumers
match it against the wider `DType` alias.

Tests updated and extended for the new types and constants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ators

Restructure `zarr_metadata.dtype.*` so each spec data type lives in its
own module, mirroring the per-codec layout in `zarr_metadata.codec.*`
and the per-dtype directory layout in zarr-extensions.

New per-type modules (one per spec data type):
  bool.py, int8/16/32/64.py, uint8/16/32/64.py,
  float16/32/64.py, complex64/128.py,
  bytes.py, string.py, numpy_datetime64.py, numpy_timedelta64.py,
  struct.py, raw.py

Each module exports:
  - {DTYPE}_DTYPE_NAME (Final str)
  - {DType}DTypeName (Literal)
  - For envelope types: a {DType} TypedDict + a {DType}Configuration
  - {DType}FillValue alias for the JSON shape of `fill_value`

Removed `null_terminated_bytes` and `fixed_length_utf32` from
zarr-metadata: they are not in zarr-specs or zarr-extensions; they are
zarr-python-specific. Their `LengthBytesConfig` and
`FixedLengthBytesConfig` TypedDicts now live locally in zarr-python at
src/zarr/core/dtype/npy/{string,bytes}.py.

zarr.core.dtype.npy.common now imports `DateTimeUnit` from
`zarr_metadata.dtype.numpy_datetime64`. zarr.core.dtype.npy.time imports
`TimeConfig` (aliased from `NumpyDatetime64Configuration`).

NewType + validating-constructor pattern for non-literal spec strings:
  - HexFloat{16,32,64} for the float hex-string fill values
  - Base64Bytes for the `bytes` base64 fill value
  - RawBytesDTypeName for the `r<N>` parameterised name

These make spec-format constraints visible to the type system; the
matching validating constructors (e.g. `hex_float32`) are the only
runtime logic in the package and are minimal regex checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ chunk_key_encoding

Move chunk-grid TypedDicts out of `v3/array.py` into per-type modules,
mirroring the per-codec and per-dtype layouts:

  packages/zarr-metadata/src/zarr_metadata/v3/
  ├── chunk_grid/
  │   ├── __init__.py
  │   ├── regular.py        # core spec
  │   └── rectilinear.py    # zarr-extensions
  └── chunk_key_encoding/
      ├── __init__.py       # ChunkKeySeparator alias
      ├── default.py        # core spec
      └── v2.py             # core spec

Each module exports:
  - {NAME}_NAME (Final str)
  - {Name} (TypedDict envelope)
  - {Name}Configuration (TypedDict body)
  - {Name}Name (Literal type of the `name` field)

`v3/array.py` shrinks to just `AllowedExtraField`, `MetadataField`, and
`ArrayMetadataV3`. `chunk_grid` and `chunk_key_encoding` fields stay
typed as `MetadataField` (str | NamedConfig) -- narrowing them to a
specific union belongs in a future validation layer, not in the
spec-faithful types layer.

Configuration TypedDicts renamed from `*Config` to `*Configuration`
to match the dtype/codec naming. zarr.core.metadata.v3 re-exports
preserve the legacy `*Config` aliases via `as` imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both directories model v3-spec artifacts, so they belong under the
v3/ subpackage alongside v3/array, v3/group, v3/consolidated,
v3/chunk_grid, and v3/chunk_key_encoding.

The principle is now: anything imported from `zarr_metadata.v3.X` is a
v3-spec artifact; anything from `zarr_metadata.v2.X` is a v2-spec
artifact; only true cross-version primitives sit at the top level
(`zarr_metadata.JSON`, `NamedConfig`, `NamedRequiredConfig`, and the
`ArrayMetadata`/`GroupMetadata` unions).

Path moves:
  zarr_metadata.codec.*  -> zarr_metadata.v3.codec.*
  zarr_metadata.dtype.*  -> zarr_metadata.v3.dtype.*

Internal imports inside the moved modules and zarr-python re-export
sites updated accordingly. zarr.abc.codec imports the zarr-metadata
Codec alias with a private name to avoid colliding with its own
runtime `Codec` union (`ArrayArrayCodec | ArrayBytesCodec |
BytesBytesCodec`), then re-exports as `CodecJSON`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches the v3 spec field name `data_type` exactly. All imports inside
the package and in zarr-python re-export sites updated accordingly.

The `DType` type alias keeps its short name (it's the widely understood
abbreviation for "data type JSON shape"); only the module path changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant