Experimental Support for Subarray DTypes#3587
Experimental Support for Subarray DTypes#3587sehoffmann wants to merge 6 commits intozarr-developers:mainfrom
Conversation
… for nested Structured dtypes in V2 (zarr-developers#3582, zarr-developers#3583)
221cc75 to
c21fcc6
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3587 +/- ##
==========================================
+ Coverage 93.10% 93.12% +0.02%
==========================================
Files 85 86 +1
Lines 11193 11334 +141
==========================================
+ Hits 10421 10555 +134
- Misses 772 779 +7
🚀 New features to boost your workflow:
|
|
@d-v-b Don't want to be pushy here, but did you manage to have a look at this PR yet? Do you have any feedback or is there anything that needs to be changed or addressed? |
Hi @sehoffmann sorry for the long silence. I think there are 2 distinct elements in this PR: first is improving how we handle numpy structured dtypes, and the second is including sub-array data types. The first element looks great, but I have some concerns about the second element. So far we have tried to keep the set of supported data types as close as possible to the union of the data types zarr python v2 supported, plus the data types supported by other zarr v3 implementations (namely, zarrs and tensorstore). This means when we add a new data type, there are two questions to answer: is this dtype something people used in zarr python 2, (and if does adding it resolve a feature regression)? or, is this dtype something the other zarr v3 implementations are supporting? If the answer to both of those is "no", then it seems like the maintenance burden for zarr-python might not be worth it, compared to the alternative of users registering this data type themselves via the registry. And I think sub-arrays are not something people used heavily in zarr python 2.x, nor are they supported by other zarr v3 implementations (please correct me if I'm wrong on either of these points). How important is it for your application that this data type is bundled with Zarr python? And if that outcome is very important, would you be willing to work on a data type spec in the zarr-extensions repo? I think I'd support adding the new subarray data type unreservedly if there was buy-in from other zarr implementers. Without that buy-in, I'm pretty skeptical about the addition, and I would encourage using the data type registry to register the data type instead of relying on it being shipped wit zarr-python. these are just my thoughts though, it would be good to hear from the other devs @zarr-developers/python-core-devs |
|
Following the implementation of the zarr-extension for structured dtypes, I've rebased this branch to the latest main and kept only the changes related to subarrays inside structured dtypes. See sehoffmann#1 |
|
Hey @d-v-b, just to follow up on this. This PR would be ready to merge if it would add support for subarray dtypes as part of |
|
hi @sehoffmann we are still missing a spec for an "array of scalars" dtype. This should probably be a generic fixed-length data type that's configured with its length and the type of its contents. Once we have a spec for that, then it composes with the new struct dtype. |
|
Hey @d-v-b, Just supporting subarrays as part of I.e. If that is fine with you, I could proceed by adapting this PR such that subarrays are only supported as part of |
|
hi @sehoffmann, we now have a spec for a language-agnostic struct dtype, and this was added to zarr-python recently. The spec defines the struct dtype as generic over other dtypes. So if we want the struct dtype to support subarrays in a way that could work for implementations outside of zarr-python, the best way forward would be a subarray dtype spec in zarr-extensions, then the struct dtype would "just work" without any changes. I opened an issue about this in zarr-extensions: zarr-developers/zarr-extensions#57. I don't think the spec for a subarray dtype would be too much work (but I don't have time for it at the moment). |
|
I see, thanks for the clarification. How do you want to handle the backward compatibility to zarr v2? If I am not mistaken, subarrays were an integral part of |
@d-v-b
This PR adds experimental support for subarray dtypes (https://numpy.org/doc/stable/glossary.html#term-subarray-data-type, https://numpy.org/doc/stable/user/basics.rec.html#structured-datatype-creation) and closes #3582 and #3583.
It also fixes support for nested (and subarray-containing)
Structureddtypes for Zarr v2 which worked before in 2.18.* but not anymore 3.1.*. In particular, the buggy implementation forgot that a nested structured dtype is again a list of lists and not just a single flat list.Note 1:
Subarray dtypes are in a very weird spot. They are a proper
np.dtype, particular anp.VoidDTypewith unsetfieldsattribute but setsubdtypefield. Hence, it makes sense to map them one-to-one to aZDType. This also makes sense from an implementation standpoint wrt. serialization.On the other hand, they do not have a proper scalar value. I.e. one can not create a
np.voidscalar for a subarray dtype (throws). Conceptually, a scalar value of a subarray dtype would be anp.ndarray. This, however, is not a subtype ofnp.genericdespite sharing a lot of the interface. When one creates a np.ndarray with a subarray dtype directly, the result is "flat"np.ndarraywith shapearray_shape + subarray_shape.I've decided to still implement them as separate
Subarray-ZDType and not conflate them within theStructuredclass. While this works flawlessly when used within a structured dtype, the intended use case, using them directly is not fully supported. Specifically, there is no specification for standalone subarray dtypes in Zarr V2, making a lot of test cases fail. Apart from that, some tests in test_array.py do not expect an array as scalar and hence fail. I want to stress though, that I was able to successfully create and read a Subarray zarr array with V3.Solving this conundrum adequately is beyond my possibilities and might require significant conceptual changes in Zarr. I did not add the dtype directly to
test_dtype/contest.pybut instead added a new test case forStructuredthat uses a Subarray inside which passes.Note 2: I've also added a test case for an invalid float value string which fails due to #3584. Since that test case highlights an existing bug, I've decided to leave it there.
TODO:
docs/user-guide/*.mdchanges/