Skip to content

[spark] Refactor BatchWrite subclasses into base logic + per-version wrappers#7723

Open
kerwin-zk wants to merge 1 commit intoapache:masterfrom
kerwin-zk:spark-batchwrite-refactor
Open

[spark] Refactor BatchWrite subclasses into base logic + per-version wrappers#7723
kerwin-zk wants to merge 1 commit intoapache:masterfrom
kerwin-zk:spark-batchwrite-refactor

Conversation

@kerwin-zk
Copy link
Copy Markdown
Contributor

@kerwin-zk kerwin-zk commented Apr 28, 2026

Purpose

Follow-up of #7648 (Spark 4.1 module) and a sibling of #7721. After landing the reverse-shim layout, two of the files under paimon-spark-4.0/src/main only existed as shadows because their compilation unit defined a Scala class that extends BatchWrite. Spark 4.1 added a default method BatchWrite.commit(WriterCommitMessage[], WriteSummary) whose WriteSummary parameter type does not exist on Spark 4.0; a class compiled against 4.1 that mixes in BatchWrite carries the inherited commit(.., WriteSummary) signature in its method table, which JVM ObjectStreamClass.getPrivateMethod lazy-links during Spark task serialization and crashes 4.0 with ClassNotFoundException: WriteSummary.

This PR refactors both affected classes into the same base + per-version wrapper pattern:

  • PaimonBatchWrite (used by V2 writes)
  • FormatTableBatchWrite (used by FormatTable V2 writes — was previously a private case class inside PaimonFormatTable.scala)

For each, the body lives in a new abstract base in paimon-spark-common that deliberately does not extend BatchWrite (renamed protected helpers: commitMessages, abortMessages, createPaimonDataWriterFactory, createFormatTableDataWriterFactory). Each per-version module (paimon-spark3-common, paimon-spark4-common, paimon-spark-4.0/src/main) ships a thin wrapper that mixes in BatchWrite and forwards the four BatchWrite methods to the base helpers. Routing happens through two new SparkShim factories so each Spark version's scalac compiles the right extends BatchWrite mixin.

The Spark 4.0 shadow of PaimonFormatTable.scala is no longer needed and is deleted; only the new thin FormatTableBatchWrite.scala wrapper remains under paimon-spark-4.0/src/main.

Tests

CI

API and Format

No new public API. Two internal factories added to org.apache.spark.sql.paimon.shims.SparkShim:

  • createPaimonBatchWrite(table, writeSchema, dataSchema, overwritePartitions, copyOnWriteScan)
  • createFormatTableBatchWrite(table, overwriteDynamic, overwritePartitions, writeSchema)

Documentation

No user-facing changes.

@kerwin-zk kerwin-zk force-pushed the spark-batchwrite-refactor branch from b4569e5 to d274193 Compare April 28, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant