Late materialization support for duckdb#7721
Conversation
Merging this PR will degrade performance by 33.12%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | WallTime | mix[0%_in/100%_out] |
227.3 µs | 284.7 µs | -20.14% |
| ❌ | WallTime | dynamic_dispatch_u32[10M] |
109.6 µs | 163.8 µs | -33.12% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
246.1 ns | 275.3 ns | -10.6% |
Comparing myrrc/duckdb-row-id-columns (66fd178) with develop (70eee88)
081850e to
a98946d
Compare
|
The disagreement point with @joseph-isaacs was about propagating partition range/selection to scan reader vs. not doing this. Unfortunately, you can't do it other way right now because partition stream is non-deterministic and you can't filter based on the stream itself due to file-level pruning. Iterating the stream twice is not an option either because there is some significant work (opening files) for each iteration, and since we don't use cached footers (we push them to cache but never read them) it's costly as well. We've agreed on merging this change, and then I'd work on adding stable partition IDs so partition filtering and, consequently, row range could happen out of the Scan request. |
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.050x ➖ datafusion / vortex-file-compressed (1.050x ➖, 0↑ 4↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.983x ➖, 1↑ 1↓)
datafusion / vortex-compact (0.908x ➖, 2↑ 0↓)
datafusion / parquet (0.909x ➖, 5↑ 0↓)
duckdb / vortex-file-compressed (0.900x ✅, 4↑ 0↓)
duckdb / vortex-compact (0.951x ➖, 0↑ 0↓)
duckdb / parquet (0.938x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.970x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.991x ➖, 0↑ 0↓)
datafusion / parquet (0.994x ➖, 0↑ 1↓)
datafusion / arrow (0.963x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.975x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.971x ➖, 0↑ 0↓)
duckdb / parquet (0.999x ➖, 0↑ 1↓)
duckdb / duckdb (0.947x ➖, 3↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.899x ✅, 50↑ 0↓)
datafusion / vortex-compact (0.919x ➖, 32↑ 0↓)
datafusion / parquet (0.913x ➖, 37↑ 0↓)
duckdb / vortex-file-compressed (0.925x ➖, 33↑ 0↓)
duckdb / vortex-compact (0.926x ➖, 25↑ 0↓)
duckdb / parquet (0.933x ➖, 20↑ 0↓)
duckdb / duckdb (0.898x ✅, 48↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.054x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.048x ➖, 0↑ 0↓)
datafusion / parquet (1.072x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.971x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.999x ➖, 0↑ 0↓)
duckdb / parquet (1.001x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.928x ➖ unknown / unknown (0.969x ➖, 6↑ 1↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.977x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.980x ➖, 0↑ 0↓)
datafusion / parquet (0.974x ➖, 0↑ 0↓)
datafusion / arrow (0.975x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.993x ➖, 0↑ 0↓)
duckdb / parquet (0.992x ➖, 0↑ 0↓)
duckdb / duckdb (0.996x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (1.009x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.039x ➖, 0↑ 1↓)
duckdb / parquet (1.011x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.955x ➖, 1↑ 1↓)
datafusion / parquet (0.984x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.990x ➖, 1↑ 0↓)
duckdb / parquet (1.011x ➖, 0↑ 2↓)
duckdb / duckdb (0.978x ➖, 3↑ 0↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.976x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.994x ➖, 1↑ 1↓)
datafusion / parquet (1.087x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.024x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.091x ➖, 0↑ 1↓)
duckdb / parquet (1.009x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 0.990x ➖ unknown / unknown (0.997x ➖, 6↑ 4↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.988x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.046x ➖, 0↑ 1↓)
datafusion / parquet (0.934x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.999x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.960x ➖, 0↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 0↓)
Full attributed analysis
|
a98946d to
5679f17
Compare
5679f17 to
e1fcdd9
Compare
e1fcdd9 to
66fd178
Compare
| // row_idx will be rearranged to correct position in scan(), prepend | ||
| // here | ||
| let row_idx = cast(row_idx(), DType::Primitive(PType::I64, false.into())); | ||
| let row_idx_struct = pack([("file_row_number", row_idx)], false.into()); |
There was a problem hiding this comment.
what happens if a file has this column name in it file_row_number.
Add file_index, file_row_number virtual columns.
Add file-based filtering (range, selection) to ScanRequest.
Add partition index method.
Add late materialization support and row id columns support in duckdb.
Attempt 1 was accidentally merged at #7631 and reverted