Skip to content

Epic: Extension Types #7683

@connortsui20

Description

@connortsui20

This Epic is for adding robust support for extension types in Vortex. An extension type is a logical type that is not "built-in" to the enum DType. For example, a Vector type would be an extension type, whereas Primitive is built-in to the DType enum.

This is a followup of some older issues and documents:

Status

Active.

We are in the process of adding different extension types (like UUID, Vectors, Tensors, and potentially Refinement types) to the Vortex main repository, which helps inform us of the best way to support all possible extension types that people might want to add in the future.

Goal

Users should be able to easily extend the Vortex type system with their own type constructs that nicely fit their application model and use case. Note that this Epic will link to different first-class extension type implementations for visibility, but the purpose of this Epic is to track the underlying systems that support creating and utilizing extension types in Vortex.

Motivation

A limitation of the current type system in Vortex is that we cannot easily add new logical types. For example, the effort to add FixedSizeList (vortex#4372) and also change List to ListView (vortex#4699) was very intrusive. It is much easier to add wrappers around canonical types (treating the canonical dtype as a "storage type") and implement some additional logic than to add a new variant to the DType enum.

We would like to add many more extension types. Some notable extension types include:

  • Time: We already have basic support for some different notions of Time in Vortex, but this can be improved.
  • Vector / Matrix / FixedShapeTensor: This would be an extension over FixedSizeList, where dimensions correspond to levels of nesting.
  • VariableShapeTensor: This would help support storing variable-size and variable-length videos in Vortex.
  • Uuid: Since this is a 128-bit number, we likely want to add FixedSizeBinary.
  • Geospatial: This would be similar to the Arrow canonical extension type: https://github.com/geoarrow/geoarrow.
  • Json: TODO
  • Images (are PNG and JPG logical types?)
  • Video (are different codecs logical types?)
  • ???
  • Refinement Types: More on this in the Type System RFC (TODO).

We would like the Vortex extension type system to be expressive enough that anyone can add their own extension type by simply providing an implementation of the ExtVTable.

Additionally, we want extension types to be just as performant as the native logical types in Vortex (like Primitive or List), and have access to all Vortex features like expression evaluation, late materialization, compression, etc.

Unresolved questions

  • TODO (there are definitely unresolved questions, need to fill this out).
  • Extension validation logic (of storage scalars, of the storage array, etc).
  • Refinement types???
  • Should the canonical extension type imply that the storage array is canonical? Right now, when we canonicalize an ExtensionArray it is a no-op.

Notes

Metadata

Metadata

Assignees

Labels

epicPublic roadmap umbrella for a major initiative, with work tracked in sub-issues.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions