Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel#1637
Conversation
…binlog sentinel Buffer allEventsUpToLockProcessed to MaxRetries() so the applier's send always completes immediately even after waitForEventsUpToLock has timed out and exited.
|
I think this is the largest ratio of PR description to lines of code I have ever come across |
|
Thanks for the bug fix @VarunChandola. Can you |
There was a problem hiding this comment.
Pull request overview
Fixes a gh-ost cutover failure mode where a timed-out waitForEventsUpToLock can leave a stale sentinel that later blocks the single-threaded applier goroutine, permanently stalling the migration.
Changes:
- Make
Migrator.allEventsUpToLockProcesseda buffered channel sized toMaxRetries()to avoid applier deadlock after cutover timeouts. - Add inline documentation explaining why the channel is buffered.
Show a summary per file
| File | Description |
|---|---|
| go/logic/migrator.go | Buffers the lock-sentinel delivery channel to prevent applier deadlock when cutover retries time out. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 1/1 changed files
- Comments generated: 2
| // Buffered to MaxRetries() to prevent a deadlock when waitForEventsUpToLock times out. | ||
| // (see https://github.com/github/gh-ost/pull/1637) | ||
| allEventsUpToLockProcessed: make(chan *lockProcessedStruct, context.MaxRetries()), |
There was a problem hiding this comment.
make(chan *lockProcessedStruct, context.MaxRetries()) won’t compile because the channel buffer size parameter must be of type int, but MigrationContext.MaxRetries() returns int64 (go/base/context.go:495). Convert to int (and consider guarding against overflow if you want to be extra defensive).
| // Buffered to MaxRetries() to prevent a deadlock when waitForEventsUpToLock times out. | ||
| // (see https://github.com/github/gh-ost/pull/1637) | ||
| allEventsUpToLockProcessed: make(chan *lockProcessedStruct, context.MaxRetries()), |
There was a problem hiding this comment.
This change is meant to prevent a specific deadlock scenario, but there’s no regression test exercising the “timeout then applier executes stale sentinel” behavior. Consider adding a unit test that constructs a Migrator, performs a non-blocking send on allEventsUpToLockProcessed without any receiver, and/or asserts the channel has the expected buffering based on MaxRetries() so future refactors don’t reintroduce the unbuffered-channel deadlock.
PR github#1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts. Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics. When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel. This guarantees: - The current sentinel is always delivered (no message loss) - The executeWriteFuncs worker is never blocked (no deadlock) - Memory usage is constant regardless of MaxRetries() (no OOM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR github#1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts. Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics. When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel. This guarantees: - The current sentinel is always delivered (no message loss) - The executeWriteFuncs worker is never blocked (no deadlock) - Memory usage is constant regardless of MaxRetries() (no OOM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR github#1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts. Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics. When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel. This guarantees: - The current sentinel is always delivered (no message loss) - The executeWriteFuncs worker is never blocked (no deadlock) - Memory usage is constant regardless of MaxRetries() (no OOM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1666) PR #1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts. Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics. When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel. This guarantees: - The current sentinel is always delivered (no message loss) - The executeWriteFuncs worker is never blocked (no deadlock) - Memory usage is constant regardless of MaxRetries() (no OOM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes #971.
We recently hit a case where gh-ost stalled permanently mid-cutover: heartbeat lag grew
without bound and no further cutover attempts were made despite the process still running.
Background
As part of atomic cutover, gh-ost locks the original table, waits for all in-flight DML
to be applied to the ghost table, then executes the rename. The "wait for in-flight DML"
step works via a sentinel value written to the changelog table. This sentinel travels via
the binlog stream back to gh-ost, which applies binlog events sequentially through a
single-threaded
applyEventsQueue. The sentinel is enqueued onto that same queue, so bythe time the applier executes it, all preceding DML is guaranteed to have been applied:
waitForEventsUpToLockthen blocks waiting to receive onallEventsUpToLockProcessed,which only fires once the applier has worked through the queue and executed the sentinel.
The bug
Under high DML load the queue backlog is large enough that the sentinel isn't executed
before
CutOverLockTimeoutSecondsexpires. This is subtle: the logs show the sentinelbeing intercepted quickly, which is easy to misread as the sentinel having been
delivered.
Handled changelog state AllEventsUpToLockProcessedonly meansonChangelogStateEventspawned the goroutine to push ontoapplyEventsQueue— not thatthe applier has executed it or that
allEventsUpToLockProcessedhas been written to.From our logs:
waitForEventsUpToLocktimes out and exits, but the sentinel func is still in the queue.When the applier eventually reaches it and executes it, it tries to send to
allEventsUpToLockProcessed— an unbuffered channel. The only goroutine that reads fromthis channel is
waitForEventsUpToLock, which has already exited. The applier blockspermanently on that send.
On subsequent cutover attempts, we never reach
waitForEventsUpToLockat all. ThesleepWhileTrueheartbeat lag guard at the top ofcutOver()sees high heartbeat lag —caused by the blocked applier, which has stopped processing all binlog events including
heartbeats — and spins forever waiting for it to recover. It never does. The migration is
permanently stalled with no error, no restart, and no further progress. This matches the
symptoms in #971.
Fix
Buffer
allEventsUpToLockProcessedtoMaxRetries(). There are arguably less wastefulapproaches, but this is the simplest: the applier's send always completes immediately
regardless of whether
waitForEventsUpToLockis still listening. Stale sentinels fromtimed-out attempts accumulate in the buffer and are discarded by the existing
state != challengeskip loop on subsequent calls towaitForEventsUpToLock.