Skip to content

Fix region migration reliability regressions#17513

Open
Pengzna wants to merge 1 commit intoapache:masterfrom
Pengzna:codex/region-migration-log-investigation
Open

Fix region migration reliability regressions#17513
Pengzna wants to merge 1 commit intoapache:masterfrom
Pengzna:codex/region-migration-log-investigation

Conversation

@Pengzna
Copy link
Copy Markdown
Collaborator

@Pengzna Pengzna commented Apr 17, 2026

Summary

  • avoid double retry storms when deleting old region peers on DataNodes that are already Unknown
  • allow IoTConsensus createLocalPeer to reuse an existing consensus directory after cluster crash recovery
  • assign IoTConsensusV2 realtime replicate indexes lazily when events are actually supplied so HybridSource cannot create holes

Validation

  • mvn -Dtest=RegionMaintainHandlerConsensusPipeTest -DskipITs -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -ntp test
  • mvn -Dtest=StabilityTest#createLocalPeerShouldAllowExistingConsensusDir -DskipITs -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -ntp test
  • mvn -pl iotdb-core/datanode -am -Dtest=PipeRealtimeReplicateIndexAssignmentTest -DskipITs -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -ntp test
  • mvn verify -pl integration-test -am -P with-integration-tests,DailyIT -DskipUTs -DintegrationTest.forkCount=1 -DConfigNodeMaxHeapSize=256 -DDataNodeMaxHeapSize=1024 -DDataNodeMaxDirectMemorySize=768 -Dit.test=IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV1IT#crashAfterDelete -Dfailsafe.failIfNoSpecifiedTests=false -Dsurefire.failIfNoSpecifiedTests=false -ntp
  • mvn verify -pl integration-test -am -P with-integration-tests,DailyIT -DskipUTs -DintegrationTest.forkCount=1 -DConfigNodeMaxHeapSize=256 -DDataNodeMaxHeapSize=1024 -DDataNodeMaxDirectMemorySize=768 -Dit.test=IoTDBIoTConsensusV2Stream3C3DBasicIT#testDeleteTimeSeriesReplicaConsistency -Dfailsafe.failIfNoSpecifiedTests=false -Dsurefire.failIfNoSpecifiedTests=false -ntp
  • mvn verify -pl integration-test -am -P with-integration-tests,DailyIT -DskipUTs -DintegrationTest.forkCount=1 -DConfigNodeMaxHeapSize=256 -DDataNodeMaxHeapSize=1024 -DDataNodeMaxDirectMemorySize=768 -Dit.test=IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV2BatchIT,IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV2StreamIT,IoTDBRegionMigrateClusterCrashIoTV1IT#clusterCrash1,IoTDBIoTConsensusV2Batch3C3DBasicIT#testDeleteTimeSeriesReplicaConsistency -Dfailsafe.failIfNoSpecifiedTests=false -Dsurefire.failIfNoSpecifiedTests=false -ntp

Copilot AI review requested due to automatic review settings April 17, 2026 17:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses region migration reliability regressions across ConfigNode region-maintenance RPC behavior, IoTConsensus crash-recovery peer creation, and IoTConsensusV2 realtime replication index assignment in the Pipe subsystem.

Changes:

  • Avoid “double retry storms” by using a single RPC attempt for DELETE_OLD_REGION_PEER when the target DataNode is Unknown.
  • Allow IoTConsensus#createLocalPeer to proceed when the consensus peer directory already exists (e.g., crash recovery).
  • Assign IoTConsensusV2 realtime replicateIndexForIoTV2 lazily at supply-time (and idempotently), with added unit tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
iotdb-core/datanode/src/test/java/org/apache/iotdb/db/pipe/source/dataregion/realtime/PipeRealtimeReplicateIndexAssignmentTest.java Adds a unit test to validate lazy + idempotent replicate index assignment behavior.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/pipe/source/dataregion/realtime/assigner/PipeDataRegionAssigner.java Removes eager replicate index assignment during event assignment.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/pipe/source/dataregion/realtime/PipeRealtimeDataRegionTsFileSource.java Assigns replicate index (if needed) at supply() time.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/pipe/source/dataregion/realtime/PipeRealtimeDataRegionSource.java Introduces shared lazy/idempotent replicate index assignment helpers.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/pipe/source/dataregion/realtime/PipeRealtimeDataRegionLogSource.java Assigns replicate index (if needed) at supply() time.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/pipe/source/dataregion/realtime/PipeRealtimeDataRegionHybridSource.java Assigns replicate index (if needed) at supply() time to avoid holes.
iotdb-core/consensus/src/test/java/org/apache/iotdb/consensus/iot/StabilityTest.java Adds a test to ensure createLocalPeer tolerates an existing consensus directory.
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/iot/IoTConsensus.java Allows reuse of an existing consensus peer directory in createLocalPeer.
iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandlerConsensusPipeTest.java Adds tests verifying retry behavior changes based on DataNode status.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java Uses node status to decide between full retry vs single-attempt RPC for deleting old peers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

String path = buildPeerDir(storageDir, groupId);
File file = new File(path);
if (!file.mkdirs()) {
if (!file.exists() && !file.mkdirs()) {
Comment on lines +487 to +491
protected boolean shouldAssignReplicateIndex(final Event suppliedEvent) {
return !(suppliedEvent instanceof ProgressReportEvent)
&& DataRegionConsensusImpl.getInstance() instanceof IoTConsensusV2
&& IoTConsensusV2Processor.isShouldReplicate((EnrichedEvent) suppliedEvent);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants