Skip to content

Add is_null/is_not_true trait methods to PhysicalExpr #21623

Open
SubhamSinghal wants to merge 14 commits intoapache:mainfrom
SubhamSinghal:null-rejection-trait
Open

Add is_null/is_not_true trait methods to PhysicalExpr #21623
SubhamSinghal wants to merge 14 commits intoapache:mainfrom
SubhamSinghal:null-rejection-trait

Conversation

@SubhamSinghal
Copy link
Copy Markdown
Contributor

@SubhamSinghal SubhamSinghal commented Apr 14, 2026

Which issue does this PR close?

Rationale for this change

Today, EliminateOuterJoin uses extract_non_nullable_columns() with explicit pattern matching for each Expr variant to determine if a WHERE predicate rejects NULL rows. Every new expression type (LIKE, IS TRUE, etc.) must be added manually to this function. This is brittle and doesn't scale.

This PR introduces is_null and is_not_true trait methods on PhysicalExpr, so each expression type defines its own null-rejection behavior. This also enables future use in parquet pruning at the physical layer.

What changes are included in this PR?

New trait methods on PhysicalExpr:

  • is_null(null_columns) -> Option<bool>: returns whether the expression is guaranteed to evaluate to NULL when the given columns are NULL
  • is_not_true(null_columns) -> Option<bool>: returns whether the expression is guaranteed to evaluate to NULL or FALSE when the given columns are NULL

Are these changes tested?

Yes, with UT

Are there any user-facing changes?

No

@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 14, 2026
@SubhamSinghal
Copy link
Copy Markdown
Contributor Author

@2010YOUY01 @asolimando I am adding is_not_true function in physical expressions as suggested here. Please have a look. Thanks.

@SubhamSinghal SubhamSinghal changed the title Null rejection trait Add is_null/is_not_true trait methods to PhysicalExpr Apr 19, 2026
@SubhamSinghal
Copy link
Copy Markdown
Contributor Author

@adriangb can you help in reviewing this PR

@adriangb
Copy link
Copy Markdown
Contributor

Thanks for the PR!

Personally I would like to review:

  1. A PR that adds the method to PhysicalExpr and all of the implementations, with unit tests for each one (especially the complex ones like CaseExpr or BinaryExpr)
  2. A followup PR that uses these methods for EliminateOuterJoin

@github-actions github-actions Bot removed optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 24, 2026
@SubhamSinghal
Copy link
Copy Markdown
Contributor Author

Removed EliminateOuterJoin related code changes from this PR. Will create follow-up PR for it.

Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! My main concern is about the representation of Option<bool> vs. an enum.

/// `null_columns` contains the column indices (in the input schema) that
/// are assumed to be `NULL`.
///
/// - `Some(true)`: definitely evaluates to `NULL` or `FALSE`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - `Some(true)`: definitely evaluates to `NULL` or `FALSE`
/// - `Some(true)`: some rows may evaluate `NULL` or `FALSE`

I assume the point is not that this expression is always NULL/FALSE for any input, rather that it may sometimes be NULL/FALSE?

If this distinction may be useful, should we make an enum?

enum IsFalsy {
  /// Falsy (`false` or `null`) for any input
  Always,
  /// May be `false` or `null` for some inputs but not others.
  /// This is the conservative default.
 Sometimes,
 /// Never falsy
 Never,
}

These are the same 3 states but more explicitly spelled out IMO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None return value denote Sometimes enum but having an enum make our intent more explicit. I will refactor code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored.

Comment thread datafusion/physical-expr/src/expressions/binary.rs
@adriangb
Copy link
Copy Markdown
Contributor

Starting another round of review sorry for the delay

Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! This is step in the right direction, but I have some concerns about the current API shape and a few concrete bugs.

Concrete bugs

1. Doc comment for is_not_true has the variants swapped

/// - [`IsFalsy::Always`]: some rows may evaluate `NULL` or `FALSE`
/// - [`IsFalsy::Never`]: definitely does NOT evaluate to `NULL` or `FALSE`

Always should be "definitely evaluates to NULL or FALSE for all rows." The text "some rows may" describes Sometimes. The is_null doc is correct; this one was likely copy-paste-edited.

2. Never is overloaded as a default for "no information"

Many implementations return Never as a fall-through rather than as a proven negative:

  • Column::is_null returns Never when the column isn't in null_columns (we don't actually know it's non-null, especially if we are going to be using this for predicate pruning in parquet).
  • Column::is_not_true returns Never for any column outside null_columns.
  • Literal::is_not_true returns Never for any non-null non-boolean literal.

This conflates "we proved it's not the case" with "we have no information," collapsing the tri-state into effectively a bi-state. Callers can't trust Never to mean "proven."

3. NotExpr::is_not_true misses the most useful case (acknowledged in tests)

NOT(IS NULL(a)) with a null evaluates to NOT(TRUE) = FALSE, which is not-true — but the implementation returns Never. The test not_is_null_does_reject is named to suggest rejection but asserts IsFalsy::Never. This is a regression vs. what a hand-rolled walker can detect.

Bigger API concern

The semantics implicitly assume "columns NOT in null_columns are non-null." This isn't documented but is load-bearing — IsNotNullExpr::is_not_true, Column::is_null, etc. only return correct answers under this convention.

That convention works for the EliminateOuterJoin use case (where one side of the join is fully nulled at once), but for parquet pruning there is no such guarantee — a column not in any "known-null" might just be missing stats, etc.

I'd suggest either:

  • (a) explicitly scoping the doc and renaming methods to make the null-rejection convention clear, or
  • (b) switching to the conservative semantics where columns outside null_columns have unknown nullness.

I think I worked through the cases and the conservative version still gives correct (and equally optimizable) answers for the EliminateOuterJoin integration, since that caller always passes a whole side at once. So conservative is strictly safer.

Suggestion: a 4-state truth lattice

The deeper issue is that IsFalsy (a 3-state property indicator) makes some operators harder to reason about than they need to be. Several precision losses (NULL AND TRUE = NULL not detected, NOT(IS NULL) rejection not detected) stem from not being able to express "this is definitely TRUE" vs. "this is definitely FALSE."

A cleaner shape:

enum Truth {
    AlwaysTrue,
    AlwaysFalse,
    AlwaysNull,
    Unknown,
}

enum Nullness {
    AlwaysNull,
    NeverNull,
    Unknown,
}

trait PhysicalExpr {
    /// Defined for all expressions.
    fn nullness(&self, null_cols: &HashSet<usize>) -> Nullness;

    /// Defined for boolean expressions only.
    fn truthiness(&self, null_cols: &HashSet<usize>) -> Truth;
}

Boolean ops become straight truth-tables over Truth (e.g., NULL AND FALSE = FALSE is one match arm), NOT is a clean flip, CASE can prune branches when the WHEN is determined, etc. Non-boolean expressions only expose nullnessWHERE int_expr is a type error in DataFusion, so we never need to assign truthiness to non-booleans.

The two consumer use cases collapse to:

  • Null-rejection: truthiness == AlwaysFalse || truthiness == AlwaysNull.
  • Pruning: same check.

This also makes the parquet-pruning reuse straightforward under the conservative-semantics fix above.

Example operator rules (not exhaustive, AI generated):

Comparisons (=, <, …) — NULL-propagating, ask nullness of children:

  1. If either child has nullness = AlwaysNullAlwaysNull.
  2. Else → Unknown (we don't track values).

AND truth table:

AT AF AN U
AT AT AF AN U
AF AF AF AF AF
AN AN AF AN U
U U AF U U

OR truth table:

AT AF AN U
AT AT AT AT AT
AF AT AF AN U
AN AT AN AN U
U AT U U U

NOT: AT ↔ AF, AN → AN, U → U.

IS NULL(x) / IS NOT NULL(x): derived from nullness(x); never returns AlwaysNull.

CASE WHEN c THEN t ELSE e END:

  • truthiness(c) = AlwaysTrue → result = truthiness(t)
  • truthiness(c) = AlwaysFalse | AlwaysNull → result = truthiness(e)
  • truthiness(c) = Unknown → lattice-join of truthiness(t) and truthiness(e)

Worked example: a = 1 OR b IN (1, 2, 3)

With null_cols = {a, b}:

a = 1:           nullness(a)=AlwaysNull → truthiness = AlwaysNull
b IN (1, 2, 3):  nullness(b)=AlwaysNull → truthiness = AlwaysNull
OR(AlwaysNull, AlwaysNull) = AlwaysNull

Top-level AlwaysNull → predicate rejects → outer→inner. ✓

With null_cols = {a} only:

a = 1:           AlwaysNull
b IN (1, 2, 3):  nullness(b)=Unknown → truthiness = Unknown
OR(AlwaysNull, Unknown) = Unknown

Don't reject. ✓ (If b is 1, predicate is TRUE.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants