Add is_null/is_not_true trait methods to PhysicalExpr #21623
Add is_null/is_not_true trait methods to PhysicalExpr #21623SubhamSinghal wants to merge 14 commits intoapache:mainfrom
Conversation
|
@2010YOUY01 @asolimando I am adding |
|
@adriangb can you help in reviewing this PR |
|
Thanks for the PR! Personally I would like to review:
|
|
Removed EliminateOuterJoin related code changes from this PR. Will create follow-up PR for it. |
adriangb
left a comment
There was a problem hiding this comment.
This looks good! My main concern is about the representation of Option<bool> vs. an enum.
| /// `null_columns` contains the column indices (in the input schema) that | ||
| /// are assumed to be `NULL`. | ||
| /// | ||
| /// - `Some(true)`: definitely evaluates to `NULL` or `FALSE` |
There was a problem hiding this comment.
| /// - `Some(true)`: definitely evaluates to `NULL` or `FALSE` | |
| /// - `Some(true)`: some rows may evaluate `NULL` or `FALSE` |
I assume the point is not that this expression is always NULL/FALSE for any input, rather that it may sometimes be NULL/FALSE?
If this distinction may be useful, should we make an enum?
enum IsFalsy {
/// Falsy (`false` or `null`) for any input
Always,
/// May be `false` or `null` for some inputs but not others.
/// This is the conservative default.
Sometimes,
/// Never falsy
Never,
}These are the same 3 states but more explicitly spelled out IMO.
There was a problem hiding this comment.
None return value denote Sometimes enum but having an enum make our intent more explicit. I will refactor code.
|
Starting another round of review sorry for the delay |
adriangb
left a comment
There was a problem hiding this comment.
Thanks for working on this! This is step in the right direction, but I have some concerns about the current API shape and a few concrete bugs.
Concrete bugs
1. Doc comment for is_not_true has the variants swapped
/// - [`IsFalsy::Always`]: some rows may evaluate `NULL` or `FALSE`
/// - [`IsFalsy::Never`]: definitely does NOT evaluate to `NULL` or `FALSE`
Always should be "definitely evaluates to NULL or FALSE for all rows." The text "some rows may" describes Sometimes. The is_null doc is correct; this one was likely copy-paste-edited.
2. Never is overloaded as a default for "no information"
Many implementations return Never as a fall-through rather than as a proven negative:
Column::is_nullreturnsNeverwhen the column isn't innull_columns(we don't actually know it's non-null, especially if we are going to be using this for predicate pruning in parquet).Column::is_not_truereturnsNeverfor any column outsidenull_columns.Literal::is_not_truereturnsNeverfor any non-null non-boolean literal.
This conflates "we proved it's not the case" with "we have no information," collapsing the tri-state into effectively a bi-state. Callers can't trust Never to mean "proven."
3. NotExpr::is_not_true misses the most useful case (acknowledged in tests)
NOT(IS NULL(a)) with a null evaluates to NOT(TRUE) = FALSE, which is not-true — but the implementation returns Never. The test not_is_null_does_reject is named to suggest rejection but asserts IsFalsy::Never. This is a regression vs. what a hand-rolled walker can detect.
Bigger API concern
The semantics implicitly assume "columns NOT in null_columns are non-null." This isn't documented but is load-bearing — IsNotNullExpr::is_not_true, Column::is_null, etc. only return correct answers under this convention.
That convention works for the EliminateOuterJoin use case (where one side of the join is fully nulled at once), but for parquet pruning there is no such guarantee — a column not in any "known-null" might just be missing stats, etc.
I'd suggest either:
- (a) explicitly scoping the doc and renaming methods to make the null-rejection convention clear, or
- (b) switching to the conservative semantics where columns outside
null_columnshave unknown nullness.
I think I worked through the cases and the conservative version still gives correct (and equally optimizable) answers for the EliminateOuterJoin integration, since that caller always passes a whole side at once. So conservative is strictly safer.
Suggestion: a 4-state truth lattice
The deeper issue is that IsFalsy (a 3-state property indicator) makes some operators harder to reason about than they need to be. Several precision losses (NULL AND TRUE = NULL not detected, NOT(IS NULL) rejection not detected) stem from not being able to express "this is definitely TRUE" vs. "this is definitely FALSE."
A cleaner shape:
enum Truth {
AlwaysTrue,
AlwaysFalse,
AlwaysNull,
Unknown,
}
enum Nullness {
AlwaysNull,
NeverNull,
Unknown,
}
trait PhysicalExpr {
/// Defined for all expressions.
fn nullness(&self, null_cols: &HashSet<usize>) -> Nullness;
/// Defined for boolean expressions only.
fn truthiness(&self, null_cols: &HashSet<usize>) -> Truth;
}Boolean ops become straight truth-tables over Truth (e.g., NULL AND FALSE = FALSE is one match arm), NOT is a clean flip, CASE can prune branches when the WHEN is determined, etc. Non-boolean expressions only expose nullness — WHERE int_expr is a type error in DataFusion, so we never need to assign truthiness to non-booleans.
The two consumer use cases collapse to:
- Null-rejection:
truthiness == AlwaysFalse || truthiness == AlwaysNull. - Pruning: same check.
This also makes the parquet-pruning reuse straightforward under the conservative-semantics fix above.
Example operator rules (not exhaustive, AI generated):
Comparisons (=, <, …) — NULL-propagating, ask nullness of children:
- If either child has
nullness = AlwaysNull→AlwaysNull. - Else →
Unknown(we don't track values).
AND truth table:
| ∧ | AT | AF | AN | U |
|---|---|---|---|---|
| AT | AT | AF | AN | U |
| AF | AF | AF | AF | AF |
| AN | AN | AF | AN | U |
| U | U | AF | U | U |
OR truth table:
| ∨ | AT | AF | AN | U |
|---|---|---|---|---|
| AT | AT | AT | AT | AT |
| AF | AT | AF | AN | U |
| AN | AT | AN | AN | U |
| U | AT | U | U | U |
NOT: AT ↔ AF, AN → AN, U → U.
IS NULL(x) / IS NOT NULL(x): derived from nullness(x); never returns AlwaysNull.
CASE WHEN c THEN t ELSE e END:
truthiness(c) = AlwaysTrue→ result =truthiness(t)truthiness(c) = AlwaysFalse | AlwaysNull→ result =truthiness(e)truthiness(c) = Unknown→ lattice-join oftruthiness(t)andtruthiness(e)
Worked example: a = 1 OR b IN (1, 2, 3)
With null_cols = {a, b}:
a = 1: nullness(a)=AlwaysNull → truthiness = AlwaysNull
b IN (1, 2, 3): nullness(b)=AlwaysNull → truthiness = AlwaysNull
OR(AlwaysNull, AlwaysNull) = AlwaysNull
Top-level AlwaysNull → predicate rejects → outer→inner. ✓
With null_cols = {a} only:
a = 1: AlwaysNull
b IN (1, 2, 3): nullness(b)=Unknown → truthiness = Unknown
OR(AlwaysNull, Unknown) = Unknown
Don't reject. ✓ (If b is 1, predicate is TRUE.)
Which issue does this PR close?
Rationale for this change
Today,
EliminateOuterJoinusesextract_non_nullable_columns()with explicit pattern matching for eachExprvariant to determine if a WHERE predicate rejects NULL rows. Every new expression type (LIKE, IS TRUE, etc.) must be added manually to this function. This is brittle and doesn't scale.This PR introduces
is_nullandis_not_truetrait methods onPhysicalExpr, so each expression type defines its own null-rejection behavior. This also enables future use in parquet pruning at the physical layer.What changes are included in this PR?
New trait methods on
PhysicalExpr:is_null(null_columns) -> Option<bool>: returns whether the expression is guaranteed to evaluate to NULL when the given columns are NULLis_not_true(null_columns) -> Option<bool>: returns whether the expression is guaranteed to evaluate to NULL or FALSE when the given columns are NULLAre these changes tested?
Yes, with UT
Are there any user-facing changes?
No