Late materialization support for duckdb#7631
Conversation
Polar Signals Profiling ResultsLatest Run
Previous Runs (5)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.132x ❌ datafusion / vortex-file-compressed (1.132x ❌, 0↑ 7↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.058x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.022x ➖, 0↑ 1↓)
datafusion / parquet (1.065x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.091x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.023x ➖, 0↑ 1↓)
duckdb / parquet (1.047x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.008x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.994x ➖, 0↑ 0↓)
datafusion / parquet (1.008x ➖, 0↑ 1↓)
datafusion / arrow (0.994x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.003x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.009x ➖, 0↑ 0↓)
duckdb / parquet (1.021x ➖, 0↑ 2↓)
duckdb / duckdb (1.002x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.084x ➖, 0↑ 30↓)
datafusion / vortex-compact (1.067x ➖, 0↑ 21↓)
datafusion / parquet (1.069x ➖, 0↑ 25↓)
duckdb / vortex-file-compressed (1.071x ➖, 0↑ 26↓)
duckdb / vortex-compact (1.052x ➖, 0↑ 12↓)
duckdb / parquet (1.048x ➖, 0↑ 13↓)
duckdb / duckdb (1.054x ➖, 1↑ 17↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.172x ➖, 0↑ 2↓)
datafusion / vortex-compact (0.710x ➖, 4↑ 0↓)
datafusion / parquet (1.143x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.047x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.053x ➖, 0↑ 1↓)
duckdb / parquet (1.004x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.006x ➖, 0↑ 0↓)
datafusion / parquet (1.010x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.940x ➖, 8↑ 2↓)
duckdb / parquet (0.996x ➖, 0↑ 0↓)
duckdb / duckdb (0.995x ➖, 1↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.021x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.015x ➖, 0↑ 0↓)
datafusion / parquet (1.015x ➖, 0↑ 0↓)
datafusion / arrow (1.023x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.012x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.011x ➖, 0↑ 0↓)
duckdb / parquet (1.007x ➖, 0↑ 0↓)
duckdb / duckdb (1.003x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.971x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.995x ➖, 0↑ 0↓)
duckdb / parquet (0.988x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.978x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.063x ➖, 0↑ 2↓)
datafusion / parquet (0.989x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.917x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.949x ➖, 0↑ 0↓)
duckdb / parquet (0.958x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.010x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.090x ➖, 0↑ 1↓)
datafusion / parquet (1.010x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.032x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.016x ➖, 0↑ 0↓)
duckdb / parquet (1.047x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.933x ➖ unknown / unknown (0.959x ➖, 5↑ 0↓)
|
Benchmarks: CompressionVortex (geomean): 1.006x ➖ unknown / unknown (0.994x ➖, 3↑ 3↓)
|
a3e6376 to
c54684f
Compare
| fn intersect_sorted(left: &[u64], right: &[u64]) -> Vec<u64> { | ||
| let mut result = Vec::new(); | ||
| let (mut i, mut j) = (0, 0); | ||
| while i < left.len() && j < right.len() { | ||
| match left[i].cmp(&right[j]) { | ||
| std::cmp::Ordering::Equal => { | ||
| result.push(left[i]); | ||
| i += 1; | ||
| j += 1; | ||
| } | ||
| std::cmp::Ordering::Less => i += 1, | ||
| std::cmp::Ordering::Greater => j += 1, | ||
| } | ||
| } |
There was a problem hiding this comment.
this should be in a method Selection::merge
There was a problem hiding this comment.
Selection::merge is a more generic (and hard) method to implement. Here we're sure we're handling either Selection::All or Selection::IncludeByIndex.
| for child in conj.children() { | ||
| let (sel, range) = try_from_virtual_column_filter(child)?; | ||
| if let Selection::IncludeByIndex(buf) = sel { | ||
| indices = Some(match indices { | ||
| None => buf.iter().copied().collect(), | ||
| Some(existing) => intersect_sorted(&existing, buf.as_ref()), | ||
| }); | ||
| } | ||
| if let Some(r) = range { | ||
| start = start.max(r.start); | ||
| end = end.min(r.end); | ||
| } | ||
| } | ||
| let range = (start < end).then_some(start..end); | ||
| let sel = indices | ||
| .map(|v| Selection::IncludeByIndex(Buffer::from_iter(v))) | ||
| .unwrap_or(Selection::All); | ||
| Ok((sel, range)) |
There was a problem hiding this comment.
this is just selection merge?
7d2dba4 to
081850e
Compare
| /// row range. | ||
| pub selection: Selection, | ||
| /// If we're operating on files, what files to read | ||
| pub file_selection: Selection, |
There was a problem hiding this comment.
This only makes sense on a specific type of scan
There was a problem hiding this comment.
I agree, but we need to filter out lazy non-materialized files. Is there another way?
| /// If we're operating on files, what files to read | ||
| pub file_selection: Selection, | ||
| /// If we're operating on files, what files to read | ||
| pub file_range: Option<Range<u64>>, |
There was a problem hiding this comment.
For duckdb we do, because it emits file ranges. Yes, you can turn a range into a selection, and it would probably be fast, but I think an extra parameter isn't that bad.
Main impact should be clickbench Q23 stabilisation. That's likely noise, but local runs showed its speedup as well. |
Add file_index, file_row_number virtual columns. Add file-based filtering (range, selection) to ScanRequest. Add partition index method. Add late materialization support and row id columns support in duckdb. Signed-off-by: Mikhail Kot <to@myrrc.dev>
Add file_index, file_row_number virtual columns.
Add file-based filtering (range, selection) to ScanRequest.
Add partition index method.
Add late materialization support and row id columns support in duckdb.