Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel by VarunChandola · Pull Request #1637 · github/gh-ost

VarunChandola · 2026-03-04T03:39:31Z

Fixes #971.

We recently hit a case where gh-ost stalled permanently mid-cutover: heartbeat lag grew
without bound and no further cutover attempts were made despite the process still running.

Background

As part of atomic cutover, gh-ost locks the original table, waits for all in-flight DML
to be applied to the ghost table, then executes the rename. The "wait for in-flight DML"
step works via a sentinel value written to the changelog table. This sentinel travels via
the binlog stream back to gh-ost, which applies binlog events sequentially through a
single-threaded applyEventsQueue. The sentinel is enqueued onto that same queue, so by
the time the applier executes it, all preceding DML is guaranteed to have been applied:

// onChangelogStateEvent — spawns a goroutine to enqueue the sentinel func
var applyEventFunc tableWriteFunc = func() error {
    this.allEventsUpToLockProcessed <- &lockProcessedStruct{...}
    return nil
}
go func() {
    this.applyEventsQueue <- newApplyEventStructByFunc(&applyEventFunc)
}()

waitForEventsUpToLock then blocks waiting to receive on allEventsUpToLockProcessed,
which only fires once the applier has worked through the queue and executed the sentinel.

The bug

Under high DML load the queue backlog is large enough that the sentinel isn't executed
before CutOverLockTimeoutSeconds expires. This is subtle: the logs show the sentinel
being intercepted quickly, which is easy to misread as the sentinel having been
delivered. Handled changelog state AllEventsUpToLockProcessed only means
onChangelogStateEvent spawned the goroutine to push onto applyEventsQueue — not that
the applier has executed it or that allEventsUpToLockProcessed has been written to.

From our logs:

09:56:49.616  Writing changelog state: AllEventsUpToLockProcessed:177240580961633980
09:56:49.616  Tables locked
09:56:49.620  Intercepted changelog state AllEventsUpToLockProcessed   ← binlog event seen
09:56:49.620  Handled changelog state AllEventsUpToLockProcessed       ← goroutine spawned
09:56:49.623  Waiting for events up to lock
09:56:52.617  Timeout while waiting for events up to lock              ← 3s later, timed out
09:56:52.617  Dropping magic cut-over table
09:56:53.053  Tables unlocked
09:56:53.058  Looking for magic cut-over table                         ← no further attempts

waitForEventsUpToLock times out and exits, but the sentinel func is still in the queue.
When the applier eventually reaches it and executes it, it tries to send to
allEventsUpToLockProcessed — an unbuffered channel. The only goroutine that reads from
this channel is waitForEventsUpToLock, which has already exited. The applier blocks
permanently on that send.

On subsequent cutover attempts, we never reach waitForEventsUpToLock at all. The
sleepWhileTrue heartbeat lag guard at the top of cutOver() sees high heartbeat lag —
caused by the blocked applier, which has stopped processing all binlog events including
heartbeats — and spins forever waiting for it to recover. It never does. The migration is
permanently stalled with no error, no restart, and no further progress. This matches the
symptoms in #971.

Fix

Buffer allEventsUpToLockProcessed to MaxRetries(). There are arguably less wasteful
approaches, but this is the simplest: the applier's send always completes immediately
regardless of whether waitForEventsUpToLock is still listening. Stale sentinels from
timed-out attempts accumulate in the buffer and are discarded by the existing
state != challenge skip loop on subsequent calls to waitForEventsUpToLock.

…binlog sentinel Buffer allEventsUpToLockProcessed to MaxRetries() so the applier's send always completes immediately even after waitForEventsUpToLock has timed out and exited.

ad-murray · 2026-03-04T04:53:18Z

I think this is the largest ratio of PR description to lines of code I have ever come across

meiji163 · 2026-03-09T15:59:53Z

Thanks for the bug fix @VarunChandola. Can you gofmt -s -w go/ the code if the PR is ready for review?

Copilot

Pull request overview

Fixes a gh-ost cutover failure mode where a timed-out waitForEventsUpToLock can leave a stale sentinel that later blocks the single-threaded applier goroutine, permanently stalling the migration.

Changes:

Make Migrator.allEventsUpToLockProcessed a buffered channel sized to MaxRetries() to avoid applier deadlock after cutover timeouts.
Add inline documentation explaining why the channel is buffered.

Show a summary per file

File	Description
go/logic/migrator.go	Buffers the lock-sentinel delivery channel to prevent applier deadlock when cutover retries time out.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 1/1 changed files
Comments generated: 2

Copilot · 2026-04-29T17:16:56Z

+		// Buffered to MaxRetries() to prevent a deadlock when waitForEventsUpToLock times out.
+		// (see https://github.com/github/gh-ost/pull/1637)
+		allEventsUpToLockProcessed: make(chan *lockProcessedStruct, context.MaxRetries()),


make(chan *lockProcessedStruct, context.MaxRetries()) won’t compile because the channel buffer size parameter must be of type int, but MigrationContext.MaxRetries() returns int64 (go/base/context.go:495). Convert to int (and consider guarding against overflow if you want to be extra defensive).

Copilot · 2026-04-29T17:16:57Z

+		// Buffered to MaxRetries() to prevent a deadlock when waitForEventsUpToLock times out.
+		// (see https://github.com/github/gh-ost/pull/1637)
+		allEventsUpToLockProcessed: make(chan *lockProcessedStruct, context.MaxRetries()),


This change is meant to prevent a specific deadlock scenario, but there’s no regression test exercising the “timeout then applier executes stale sentinel” behavior. Consider adding a unit test that constructs a Migrator, performs a non-blocking send on allEventsUpToLockProcessed without any receiver, and/or asserts the channel has the expected buffering based on MaxRetries() so future refactors don’t reintroduce the unbuffered-channel deadlock.

PR github#1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts. Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics. When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel. This guarantees: - The current sentinel is always delivered (no message loss) - The executeWriteFuncs worker is never blocked (no deadlock) - Memory usage is constant regardless of MaxRetries() (no OOM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…1666) PR #1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts. Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics. When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel. This guarantees: - The current sentinel is always delivered (no message loss) - The executeWriteFuncs worker is never blocked (no deadlock) - Memory usage is constant regardless of MaxRetries() (no OOM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Prevent permanent worker deadlock when cutover times out waiting for …

ff567e2

…binlog sentinel Buffer allEventsUpToLockProcessed to MaxRetries() so the applier's send always completes immediately even after waitForEventsUpToLock has timed out and exited.

ad-murray approved these changes Mar 4, 2026

View reviewed changes

VarunChandola closed this Mar 4, 2026

VarunChandola reopened this Mar 4, 2026

meiji163 added the bug label Mar 9, 2026

Merge branch 'master' into varun/fix-alleventsuptolockprocessed-deadlock

83a6830

meiji163 added 5 commits March 18, 2026 09:35

Merge branch 'master' into varun/fix-alleventsuptolockprocessed-deadlock

17a6d5a

Merge branch 'master' into varun/fix-alleventsuptolockprocessed-deadlock

9bf9d5e

Merge branch 'master' into varun/fix-alleventsuptolockprocessed-deadlock

298d0ee

Merge branch 'master' into varun/fix-alleventsuptolockprocessed-deadlock

7d1825d

gofmt & make comment succinct

24c2cc3

meiji163 marked this pull request as ready for review April 29, 2026 17:12

meiji163 requested a review from rashiq as a code owner April 29, 2026 17:12

Copilot AI review requested due to automatic review settings April 29, 2026 17:12

meiji163 requested review from meiji163 and timvaillancourt as code owners April 29, 2026 17:12

Copilot started reviewing on behalf of meiji163 April 29, 2026 17:12 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

meiji163 approved these changes Apr 29, 2026

View reviewed changes

Merge branch 'master' into varun/fix-alleventsuptolockprocessed-deadlock

55d483d

meiji163 merged commit 4deeadf into github:master Apr 29, 2026
9 checks passed

dnovitski mentioned this pull request Apr 30, 2026

Fix OOM when allEventsUpToLockProcessed buffer equals MaxRetries() #1666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel#1637

Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel#1637
meiji163 merged 8 commits intogithub:masterfrom
VarunChandola:varun/fix-alleventsuptolockprocessed-deadlock

VarunChandola commented Mar 4, 2026

Uh oh!

ad-murray commented Mar 4, 2026

Uh oh!

meiji163 commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

VarunChandola commented Mar 4, 2026

Background

The bug

Fix

Uh oh!

ad-murray commented Mar 4, 2026

Uh oh!

meiji163 commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants