Add is_null/is_not_true trait methods to PhysicalExpr by SubhamSinghal · Pull Request #21623 · apache/datafusion

SubhamSinghal · 2026-04-14T15:09:04Z

Which issue does this PR close?

Closes
related to Eliminate more outer joins by supporting more expressions #13232

Rationale for this change

Today, EliminateOuterJoin uses extract_non_nullable_columns() with explicit pattern matching for each Expr variant to determine if a WHERE predicate rejects NULL rows. Every new expression type (LIKE, IS TRUE, etc.) must be added manually to this function. This is brittle and doesn't scale.

This PR introduces is_null and is_not_true trait methods on PhysicalExpr, so each expression type defines its own null-rejection behavior. This also enables future use in parquet pruning at the physical layer.

What changes are included in this PR?

New trait methods on PhysicalExpr:

is_null(null_columns) -> Option<bool>: returns whether the expression is guaranteed to evaluate to NULL when the given columns are NULL
is_not_true(null_columns) -> Option<bool>: returns whether the expression is guaranteed to evaluate to NULL or FALSE when the given columns are NULL

Are these changes tested?

Yes, with UT

Are there any user-facing changes?

No

SubhamSinghal · 2026-04-16T15:00:21Z

@2010YOUY01 @asolimando I am adding is_not_true function in physical expressions as suggested here. Please have a look. Thanks.

SubhamSinghal · 2026-04-23T18:19:41Z

@adriangb can you help in reviewing this PR

adriangb · 2026-04-23T18:26:46Z

Thanks for the PR!

Personally I would like to review:

A PR that adds the method to PhysicalExpr and all of the implementations, with unit tests for each one (especially the complex ones like CaseExpr or BinaryExpr)
A followup PR that uses these methods for EliminateOuterJoin

SubhamSinghal · 2026-04-24T05:22:33Z

Removed EliminateOuterJoin related code changes from this PR. Will create follow-up PR for it.

adriangb

This looks good! My main concern is about the representation of Option<bool> vs. an enum.

adriangb · 2026-04-24T13:02:18Z

+    /// `null_columns` contains the column indices (in the input schema) that
+    /// are assumed to be `NULL`.
+    ///
+    /// - `Some(true)`:  definitely evaluates to `NULL` or `FALSE`


Suggested change

/// - `Some(true)`: definitely evaluates to `NULL` or `FALSE`

/// - `Some(true)`: some rows may evaluate `NULL` or `FALSE`

I assume the point is not that this expression is always NULL/FALSE for any input, rather that it may sometimes be NULL/FALSE?

If this distinction may be useful, should we make an enum?

enum IsFalsy { /// Falsy (`false` or `null`) for any input Always, /// May be `false` or `null` for some inputs but not others. /// This is the conservative default. Sometimes, /// Never falsy Never, }

These are the same 3 states but more explicitly spelled out IMO.

None return value denote Sometimes enum but having an enum make our intent more explicit. I will refactor code.

refactored.

adriangb · 2026-04-28T15:31:44Z

Starting another round of review sorry for the delay

adriangb

Thanks for working on this! This is step in the right direction, but I have some concerns about the current API shape and a few concrete bugs.

Concrete bugs

1. Doc comment for `is_not_true` has the variants swapped

/// - [`IsFalsy::Always`]: some rows may evaluate `NULL` or `FALSE`
/// - [`IsFalsy::Never`]: definitely does NOT evaluate to `NULL` or `FALSE`

Always should be "definitely evaluates to NULL or FALSE for all rows." The text "some rows may" describes Sometimes. The is_null doc is correct; this one was likely copy-paste-edited.

2. `Never` is overloaded as a default for "no information"

Many implementations return Never as a fall-through rather than as a proven negative:

Column::is_null returns Never when the column isn't in null_columns (we don't actually know it's non-null, especially if we are going to be using this for predicate pruning in parquet).
Column::is_not_true returns Never for any column outside null_columns.
Literal::is_not_true returns Never for any non-null non-boolean literal.

This conflates "we proved it's not the case" with "we have no information," collapsing the tri-state into effectively a bi-state. Callers can't trust Never to mean "proven."

3. `NotExpr::is_not_true` misses the most useful case (acknowledged in tests)

NOT(IS NULL(a)) with a null evaluates to NOT(TRUE) = FALSE, which is not-true — but the implementation returns Never. The test not_is_null_does_reject is named to suggest rejection but asserts IsFalsy::Never. This is a regression vs. what a hand-rolled walker can detect.

Bigger API concern

The semantics implicitly assume "columns NOT in null_columns are non-null." This isn't documented but is load-bearing — IsNotNullExpr::is_not_true, Column::is_null, etc. only return correct answers under this convention.

That convention works for the EliminateOuterJoin use case (where one side of the join is fully nulled at once), but for parquet pruning there is no such guarantee — a column not in any "known-null" might just be missing stats, etc.

I'd suggest either:

(a) explicitly scoping the doc and renaming methods to make the null-rejection convention clear, or
(b) switching to the conservative semantics where columns outside null_columns have unknown nullness.

I think I worked through the cases and the conservative version still gives correct (and equally optimizable) answers for the EliminateOuterJoin integration, since that caller always passes a whole side at once. So conservative is strictly safer.

Suggestion: a 4-state truth lattice

The deeper issue is that IsFalsy (a 3-state property indicator) makes some operators harder to reason about than they need to be. Several precision losses (NULL AND TRUE = NULL not detected, NOT(IS NULL) rejection not detected) stem from not being able to express "this is definitely TRUE" vs. "this is definitely FALSE."

A cleaner shape:

enum Truth {
    AlwaysTrue,
    AlwaysFalse,
    AlwaysNull,
    Unknown,
}

enum Nullness {
    AlwaysNull,
    NeverNull,
    Unknown,
}

trait PhysicalExpr {
    /// Defined for all expressions.
    fn nullness(&self, null_cols: &HashSet<usize>) -> Nullness;

    /// Defined for boolean expressions only.
    fn truthiness(&self, null_cols: &HashSet<usize>) -> Truth;
}

Boolean ops become straight truth-tables over Truth (e.g., NULL AND FALSE = FALSE is one match arm), NOT is a clean flip, CASE can prune branches when the WHEN is determined, etc. Non-boolean expressions only expose nullness — WHERE int_expr is a type error in DataFusion, so we never need to assign truthiness to non-booleans.

The two consumer use cases collapse to:

Null-rejection: truthiness == AlwaysFalse || truthiness == AlwaysNull.
Pruning: same check.

This also makes the parquet-pruning reuse straightforward under the conservative-semantics fix above.

Example operator rules (not exhaustive, AI generated):

Comparisons (=, <, …) — NULL-propagating, ask nullness of children:

If either child has nullness = AlwaysNull → AlwaysNull.
Else → Unknown (we don't track values).

AND truth table:

∧	AT	AF	AN	U
AT	AT	AF	AN	U
AF	AF	AF	AF	AF
AN	AN	AF	AN	U
U	U	AF	U	U

OR truth table:

∨	AT	AF	AN	U
AT	AT	AT	AT	AT
AF	AT	AF	AN	U
AN	AT	AN	AN	U
U	AT	U	U	U

NOT: AT ↔ AF, AN → AN, U → U.

IS NULL(x) / IS NOT NULL(x): derived from nullness(x); never returns AlwaysNull.

CASE WHEN c THEN t ELSE e END:

truthiness(c) = AlwaysTrue → result = truthiness(t)
truthiness(c) = AlwaysFalse | AlwaysNull → result = truthiness(e)
truthiness(c) = Unknown → lattice-join of truthiness(t) and truthiness(e)

Worked example: `a = 1 OR b IN (1, 2, 3)`

With null_cols = {a, b}:

a = 1:           nullness(a)=AlwaysNull → truthiness = AlwaysNull
b IN (1, 2, 3):  nullness(b)=AlwaysNull → truthiness = AlwaysNull
OR(AlwaysNull, AlwaysNull) = AlwaysNull

Top-level AlwaysNull → predicate rejects → outer→inner. ✓

With null_cols = {a} only:

a = 1:           AlwaysNull
b IN (1, 2, 3):  nullness(b)=Unknown → truthiness = Unknown
OR(AlwaysNull, Unknown) = Unknown

Don't reject. ✓ (If b is 1, predicate is TRUE.)

Subham Singhal added 2 commits April 14, 2026 20:33

Add is_null/is_not_true trait methods to PhysicalExpr

9f08bd2

lint fix

d359199

github-actions Bot added physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 14, 2026

Subham Singhal added 4 commits April 15, 2026 10:34

Fix build failure

3532cf4

Merge branch 'main' into null-rejection-trait

c8d0633

Merge branch 'main' into null-rejection-trait

4bc51f7

Fix UT

37c6a43

SubhamSinghal changed the title ~~Null rejection trait~~ Add is_null/is_not_true trait methods to PhysicalExpr Apr 19, 2026

Subham Singhal added 2 commits April 24, 2026 10:50

Remove EliminateOuterJoin changes, split to followup PR

8154351

Merge branch 'main' into null-rejection-trait

86bcb35

github-actions Bot removed optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 24, 2026

Merge branch 'main' into null-rejection-trait

8128646

adriangb reviewed Apr 25, 2026

View reviewed changes

Subham Singhal and others added 5 commits April 26, 2026 19:35

Adds IsFalsy enum

f8c7e86

Merge branch 'main' into null-rejection-trait

47502c6

Fix build failure

2f95300

Fix clippy test

9f87e1b

Merge branch 'main' into null-rejection-trait

8783f31

adriangb reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add is_null/is_not_true trait methods to PhysicalExpr #21623

Add is_null/is_not_true trait methods to PhysicalExpr #21623
SubhamSinghal wants to merge 14 commits intoapache:mainfrom
SubhamSinghal:null-rejection-trait

SubhamSinghal commented Apr 14, 2026 •

edited

Loading

Uh oh!

SubhamSinghal commented Apr 16, 2026

Uh oh!

SubhamSinghal commented Apr 23, 2026

Uh oh!

adriangb commented Apr 23, 2026

Uh oh!

SubhamSinghal commented Apr 24, 2026

Uh oh!

adriangb left a comment

Uh oh!

adriangb Apr 24, 2026

Uh oh!

SubhamSinghal Apr 26, 2026

Uh oh!

SubhamSinghal Apr 26, 2026

Uh oh!

Uh oh!

adriangb commented Apr 28, 2026

Uh oh!

adriangb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	/// - `Some(true)`: definitely evaluates to `NULL` or `FALSE`
	/// - `Some(true)`: some rows may evaluate `NULL` or `FALSE`

Conversation

SubhamSinghal commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

SubhamSinghal commented Apr 16, 2026

Uh oh!

SubhamSinghal commented Apr 23, 2026

Uh oh!

adriangb commented Apr 23, 2026

Uh oh!

SubhamSinghal commented Apr 24, 2026

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

SubhamSinghal Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

SubhamSinghal Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adriangb commented Apr 28, 2026

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Concrete bugs

1. Doc comment for is_not_true has the variants swapped

2. Never is overloaded as a default for "no information"

3. NotExpr::is_not_true misses the most useful case (acknowledged in tests)

Bigger API concern

Suggestion: a 4-state truth lattice

Example operator rules (not exhaustive, AI generated):

Worked example: a = 1 OR b IN (1, 2, 3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SubhamSinghal commented Apr 14, 2026 •

edited

Loading

1. Doc comment for `is_not_true` has the variants swapped

2. `Never` is overloaded as a default for "no information"

3. `NotExpr::is_not_true` misses the most useful case (acknowledged in tests)

Worked example: `a = 1 OR b IN (1, 2, 3)`