Description
Vortex supports list types, but currently our write support is naive.
The current builtin writer will emit a single Flat layout for any list chunks, with a maximum size of 2MB (configurable, but fixed).
┌──────────────────────────────────────────────────────────────────────┐
│ │
│ │
│ ┌───────────────────────────────────────────List Array (memory)──┐ │
│ │ ┌───────┐ │ │
│ │offsets │ │ │ │
│ │ └───────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │elements │ │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ │
│ │
│ │
│ ┌─────── Flat Layout ──────┐ │
│ ├───────────────┬──────────┤ │
│ │ offsets │ elements │ On-disk format │
│ ├───────────────┴──────────┤ │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ elements │ │
│ │ │ │
│ │ │ │
│ │ │ │
│ │ │ │
│ └──────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
For simple lists of primitives or strings, this is actually fine. Bulk access is IO-efficient, since the elements and their corresponding offsets are fetched in a single read. Random access will be on-par with access to normal primitive/string arrays.
However, there are notable cons, especially when dealing with nested Lists:
- Chunks are smaller as the 2MB budget is shared across all nested fields + offsets for the chunk
- No ability to pushdown projections through list elements at IO time. E.g. if you have
List<Struct{x,y,z}> there's no way to only read list.x, you need to read all elements
- No ability to short-circuit expression execution using just offsets or just validity, e.g.
select len(list_col) should just need to read offset but instead it forces a read of the whole elements child as well.
We're considering two different options to improve IO efficiency here:
- Adding a new explicit
ListLayout to our default writer, which shreds apart the list offsets, validity and elements children.
- Implementing sub-segment reads. This is a very pervasive change, but allows pushing down reads of portions of a segment at the IO layer for all Flat layouts, not just for lists.
There have been previous attempts to add (1), there are some notable challenges
- Neither component (offset or elements) is row-aligned, which means reporting splits is tricky
- To do this well, we probably need a
Map or Lambda expression to pull out an expression to push down over the elements. Along with other list-specific expressions/aggregates
Description
Vortex supports list types, but currently our write support is naive.
The current builtin writer will emit a single Flat layout for any list chunks, with a maximum size of 2MB (configurable, but fixed).
For simple lists of primitives or strings, this is actually fine. Bulk access is IO-efficient, since the elements and their corresponding offsets are fetched in a single read. Random access will be on-par with access to normal primitive/string arrays.
However, there are notable cons, especially when dealing with nested Lists:
List<Struct{x,y,z}>there's no way to only readlist.x, you need to read all elementsselect len(list_col)should just need to read offset but instead it forces a read of the whole elements child as well.We're considering two different options to improve IO efficiency here:
ListLayoutto our default writer, which shreds apart the listoffsets,validityandelementschildren.There have been previous attempts to add (1), there are some notable challenges
MaporLambdaexpression to pull out an expression to push down over the elements. Along with other list-specific expressions/aggregates