Skip to content

Experimental Support for Subarray DTypes#3587

Open
sehoffmann wants to merge 6 commits intozarr-developers:mainfrom
sehoffmann:subarray_dtypes
Open

Experimental Support for Subarray DTypes#3587
sehoffmann wants to merge 6 commits intozarr-developers:mainfrom
sehoffmann:subarray_dtypes

Conversation

@sehoffmann
Copy link
Copy Markdown

@d-v-b

This PR adds experimental support for subarray dtypes (https://numpy.org/doc/stable/glossary.html#term-subarray-data-type, https://numpy.org/doc/stable/user/basics.rec.html#structured-datatype-creation) and closes #3582 and #3583.

It also fixes support for nested (and subarray-containing) Structured dtypes for Zarr v2 which worked before in 2.18.* but not anymore 3.1.*. In particular, the buggy implementation forgot that a nested structured dtype is again a list of lists and not just a single flat list.

Note 1:
Subarray dtypes are in a very weird spot. They are a proper np.dtype, particular a np.VoidDType with unset fields attribute but set subdtype field. Hence, it makes sense to map them one-to-one to a ZDType. This also makes sense from an implementation standpoint wrt. serialization.

On the other hand, they do not have a proper scalar value. I.e. one can not create a np.void scalar for a subarray dtype (throws). Conceptually, a scalar value of a subarray dtype would be a np.ndarray. This, however, is not a subtype of np.generic despite sharing a lot of the interface. When one creates a np.ndarray with a subarray dtype directly, the result is "flat" np.ndarray with shape array_shape + subarray_shape.

I've decided to still implement them as separate Subarray-ZDType and not conflate them within the Structured class. While this works flawlessly when used within a structured dtype, the intended use case, using them directly is not fully supported. Specifically, there is no specification for standalone subarray dtypes in Zarr V2, making a lot of test cases fail. Apart from that, some tests in test_array.py do not expect an array as scalar and hence fail. I want to stress though, that I was able to successfully create and read a Subarray zarr array with V3.

Solving this conundrum adequately is beyond my possibilities and might require significant conceptual changes in Zarr. I did not add the dtype directly to test_dtype/contest.py but instead added a new test case for Structured that uses a Subarray inside which passes.

Note 2: I've also added a test case for an invalid float value string which fails due to #3584. Since that test case highlights an existing bug, I've decided to leave it there.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Nov 20, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 21, 2025

Codecov Report

❌ Patch coverage is 93.42105% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.12%. Comparing base (dd5a321) to head (d4cb2d8).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/zarr/core/dtype/npy/structured.py 84.00% 4 Missing ⚠️
src/zarr/core/dtype/common.py 78.57% 3 Missing ⚠️
src/zarr/core/dtype/npy/subarray.py 97.29% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3587      +/-   ##
==========================================
+ Coverage   93.10%   93.12%   +0.02%     
==========================================
  Files          85       86       +1     
  Lines       11193    11334     +141     
==========================================
+ Hits        10421    10555     +134     
- Misses        772      779       +7     
Files with missing lines Coverage Δ
src/zarr/core/dtype/__init__.py 100.00% <100.00%> (ø)
src/zarr/core/dtype/npy/bytes.py 99.50% <100.00%> (ø)
src/zarr/core/dtype/common.py 86.36% <78.57%> (+1.17%) ⬆️
src/zarr/core/dtype/npy/subarray.py 97.29% <97.29%> (ø)
src/zarr/core/dtype/npy/structured.py 95.68% <84.00%> (-3.25%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sehoffmann
Copy link
Copy Markdown
Author

@d-v-b Don't want to be pushy here, but did you manage to have a look at this PR yet? Do you have any feedback or is there anything that needs to be changed or addressed?

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Dec 19, 2025

@d-v-b Don't want to be pushy here, but did you manage to have a look at this PR yet? Do you have any feedback or is there anything that needs to be changed or addressed?

Hi @sehoffmann sorry for the long silence. I think there are 2 distinct elements in this PR: first is improving how we handle numpy structured dtypes, and the second is including sub-array data types.

The first element looks great, but I have some concerns about the second element. So far we have tried to keep the set of supported data types as close as possible to the union of the data types zarr python v2 supported, plus the data types supported by other zarr v3 implementations (namely, zarrs and tensorstore).

This means when we add a new data type, there are two questions to answer: is this dtype something people used in zarr python 2, (and if does adding it resolve a feature regression)? or, is this dtype something the other zarr v3 implementations are supporting? If the answer to both of those is "no", then it seems like the maintenance burden for zarr-python might not be worth it, compared to the alternative of users registering this data type themselves via the registry. And I think sub-arrays are not something people used heavily in zarr python 2.x, nor are they supported by other zarr v3 implementations (please correct me if I'm wrong on either of these points).

How important is it for your application that this data type is bundled with Zarr python? And if that outcome is very important, would you be willing to work on a data type spec in the zarr-extensions repo? I think I'd support adding the new subarray data type unreservedly if there was buy-in from other zarr implementers. Without that buy-in, I'm pretty skeptical about the addition, and I would encourage using the data type registry to register the data type instead of relying on it being shipped wit zarr-python.

these are just my thoughts though, it would be good to hear from the other devs @zarr-developers/python-core-devs

@vitusbenson
Copy link
Copy Markdown

Following the implementation of the zarr-extension for structured dtypes, I've rebased this branch to the latest main and kept only the changes related to subarrays inside structured dtypes.

See sehoffmann#1
And https://github.com/vitusbenson/zarr-python/tree/subarray_dtypes

@sehoffmann
Copy link
Copy Markdown
Author

Hey @d-v-b,

just to follow up on this. This PR would be ready to merge if it would add support for subarray dtypes as part of Structured(not standalone), following the spec from zarr-developers/zarr-extensions#45 ?

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Apr 21, 2026

hi @sehoffmann we are still missing a spec for an "array of scalars" dtype. This should probably be a generic fixed-length data type that's configured with its length and the type of its contents. Once we have a spec for that, then it composes with the new struct dtype.

@sehoffmann
Copy link
Copy Markdown
Author

Hey @d-v-b,

Just supporting subarrays as part of Structured would also be completely fine from my side. I see this as a matter of taste. Given that standalone subarray dtypes are behaving rather unexpected and weird compared with other dtypes, there might actually be a good reason to not allow standalone subarrays.

I.e. np.dtype([('x', 'f4'), ('y', np.float32), ('z', 'f4', (2, 2))]) should still be supported by zarr as part of Structured (as already in v2), but it is completely ok for me if np.dtype(('f4', (2,2))) doesn't have a corresponding zarr datatype. Maybe this was a misunderstanding before.


If that is fine with you, I could proceed by adapting this PR such that subarrays are only supported as part of Structured.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Apr 23, 2026

hi @sehoffmann, we now have a spec for a language-agnostic struct dtype, and this was added to zarr-python recently. The spec defines the struct dtype as generic over other dtypes.

So if we want the struct dtype to support subarrays in a way that could work for implementations outside of zarr-python, the best way forward would be a subarray dtype spec in zarr-extensions, then the struct dtype would "just work" without any changes.

I opened an issue about this in zarr-extensions: zarr-developers/zarr-extensions#57. I don't think the spec for a subarray dtype would be too much work (but I don't have time for it at the moment).

@sehoffmann
Copy link
Copy Markdown
Author

I see, thanks for the clarification. How do you want to handle the backward compatibility to zarr v2? If I am not mistaken, subarrays were an integral part of Structured there and not standalone. Is this an issue and does the new subarray dtype need to concern itself with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subarray dtypes get lost on serialization / casted to void type

3 participants