Add NormalizedVector extension type#7709
Conversation
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Merging this PR will not alter performance
Comparing Footnotes
|
1f5d4a2 to
766a8a0
Compare
30fd24f to
0d3cb51
Compare
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
|
| impl Matcher for AnyVector { | ||
| type Match<'a> = VectorMatcherMetadata; | ||
|
|
||
| fn try_match<'a>(ext_dtype: &'a ExtDTypeRef) -> Option<Self::Match<'a>> { | ||
| if !ext_dtype.is::<Vector>() { | ||
| // Walk to the inner `FixedSizeList` for whichever vector-shaped wrapper this is. Plain | ||
| // `Vector` stores the FSL directly; `NormalizedVector` wraps a `Vector` extension which | ||
| // in turn stores the FSL. | ||
| let (fsl_dtype, is_normalized) = if ext_dtype.is::<NormalizedVector>() { | ||
| let DType::Extension(inner) = ext_dtype.storage_dtype() else { | ||
| vortex_panic!( | ||
| "`NormalizedVector` storage must be `DType::Extension(Vector)`, got {}", | ||
| ext_dtype.storage_dtype(), | ||
| ) | ||
| }; | ||
|
|
||
| if !inner.is::<Vector>() { | ||
| vortex_panic!( | ||
| "`NormalizedVector` inner extension must be `Vector`, got {}", | ||
| inner.id(), | ||
| ) | ||
| } | ||
|
|
||
| (inner.storage_dtype(), true) | ||
| } else if ext_dtype.is::<Vector>() { | ||
| (ext_dtype.storage_dtype(), false) | ||
| } else { | ||
| return None; | ||
| } | ||
| }; | ||
|
|
||
| let DType::FixedSizeList(element_dtype, list_size, _) = ext_dtype.storage_dtype() else { | ||
| let DType::FixedSizeList(element_dtype, list_size, _) = fsl_dtype else { | ||
| vortex_panic!("`Vector` type somehow did not have a `FixedSizeList` storage type") |
There was a problem hiding this comment.
we cannot do this its SO expensive.
This could be called in a tight execute loop.
There was a problem hiding this comment.
You must at least impl a match
There was a problem hiding this comment.
I am confused by this, how else are we supposed to implement a matcher if we can't inspect the ext vtable?
Are you just saying that doing is::<AnyNormalizedVector> is faster? That is incredibly surprising to me and certainly not clear or documented anywhere clearly in our codebase.
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.951x ➖ datafusion / vortex-file-compressed (0.951x ➖, 1↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.991x ➖, 0↑ 1↓)
datafusion / vortex-compact (0.992x ➖, 0↑ 0↓)
datafusion / parquet (1.004x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.967x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.992x ➖, 0↑ 0↓)
duckdb / parquet (0.983x ➖, 1↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.980x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.990x ➖, 0↑ 0↓)
datafusion / parquet (1.001x ➖, 0↑ 2↓)
datafusion / arrow (0.976x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.994x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.999x ➖, 0↑ 0↓)
duckdb / parquet (0.987x ➖, 1↑ 1↓)
duckdb / duckdb (0.995x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMEFile Size Changes (195 files changed, -98.4% overall, 0↑ 195↓)
Totals:
|
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.141x ➖, 0↑ 1↓)
datafusion / vortex-compact (0.957x ➖, 0↑ 0↓)
datafusion / parquet (0.935x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.017x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.015x ➖, 0↑ 0↓)
duckdb / parquet (0.949x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.954x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.989x ➖, 0↑ 0↓)
duckdb / parquet (0.980x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.034x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.027x ➖, 0↑ 0↓)
datafusion / parquet (1.035x ➖, 0↑ 0↓)
datafusion / arrow (1.047x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.023x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.022x ➖, 0↑ 0↓)
duckdb / parquet (1.013x ➖, 0↑ 0↓)
duckdb / duckdb (1.009x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.016x ➖, 0↑ 0↓)
datafusion / parquet (1.024x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.998x ➖, 7↑ 3↓)
duckdb / parquet (1.010x ➖, 0↑ 1↓)
duckdb / duckdb (1.011x ➖, 1↑ 2↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.932x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.980x ➖, 1↑ 3↓)
datafusion / parquet (0.959x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.995x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.989x ➖, 0↑ 0↓)
duckdb / parquet (1.018x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 1.238x ❌ unknown / unknown (1.204x ❌, 1↑ 50↓)
|
Benchmarks: CompressionVortex (geomean): 1.013x ➖ unknown / unknown (1.007x ➖, 2↑ 8↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.980x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.974x ➖, 1↑ 0↓)
datafusion / parquet (0.939x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.032x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.968x ➖, 0↑ 0↓)
duckdb / parquet (1.008x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.066x ➖, 1↑ 22↓)
datafusion / vortex-compact (1.062x ➖, 0↑ 13↓)
datafusion / parquet (1.055x ➖, 0↑ 13↓)
duckdb / vortex-file-compressed (1.053x ➖, 0↑ 13↓)
duckdb / vortex-compact (1.045x ➖, 0↑ 15↓)
duckdb / parquet (1.035x ➖, 0↑ 6↓)
duckdb / duckdb (1.062x ➖, 0↑ 23↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
|
@joseph-isaacs this branch doesn't affect any of our current benchmarks (it doesn't go through this codepath) |
Summary
Tracking issue: #7297
Adds a
NormalizedVectorextension type tovortex-tensor, which wraps aVectorextension array. Since we may want to add refinement types in the future, we chose to wrap aVectorstorage array instead of aFixedSizeListstorage array.Additionally does a bunch of general cleanup that was obviously bad when I was actually trying to add this (and I wasn't really able to pull it out into a separate PR because it is only obvious when working with this new type).
There are still a bunch of TODOs that I'd like to fix but I'm probably going to do those all at once in a later PR.
Testing
Adds basic tests for the new
NormalizedVectortype and also adds some more complex tests for logic that was missing before for the older code.