bridge: add guest-side reconnect loop for live migration#2698
bridge: add guest-side reconnect loop for live migration#2698shreyanshjain7174 merged 2 commits intomicrosoft:mainfrom
Conversation
05c7170 to
5fafdf4
Compare
5fafdf4 to
c45e67a
Compare
rawahars
left a comment
There was a problem hiding this comment.
Overall looking good.
Consider the below suggestion and see if that reduced the complexity and indirections.
We do not need a publisher in-direction at all. Address of PublishNotification on bridge will remain valid at all times simply because the bridge instance is never re-created, only connections are reset.
We can streamline this by adding a buffered channel (pendingMessages) directly to the bridge. PublishNotification will simply push messages to this buffer and return, naturally blocking to provide backpressure if the buffer hits capacity. We can then run a dedicated background goroutine with a select loop to process the flow: it reads from pending buffer and forwards to responseChan. By leveraging the nil-channel pattern, we can safely disable the send case when responseChan is nil, while also listening on a quitChan for graceful shutdown.
Something on the lines of-
type Bridge struct {
// ... existing fields ...
responseChan chan bridgeResponse
quitChan chan bool
// Add a buffered channel for backpressure
pendingMessages chan bridgeResponse
}
func (b *Bridge) PublishNotification(n *prot.ContainerNotification) {
// ... (ctx/span setup) ...
resp := bridgeResponse{ /* ... */ }
// Publish pushes to the buffer and returns immediately
b.pendingMessages <- resp
}
// Background goroutine to handle the routing
// processNotifications runs in the background to route messages
func (b *Bridge) processNotifications() {
// If responseChan is nil, readChan remains nil.
// Go completely ignores select cases with nil channels, meaning
// we will never pop items from pendingBuffer. They stay queued.
var readChan chan bridgeResponse
if b.responseChan != nil {
readChan = b.pendingMessages
}
for {
select {
case resp := <-readChan:
// We only ever read from pendingBuffer if responseChan is present.
b.responseChan <- resp
case <-b.quitChan:
return
}
}
}
| // re-dial. During live migration the VM is frozen and only wakes up when | ||
| // the destination host shim is ready, so the vsock port should be | ||
| // immediately available. | ||
| for { |
There was a problem hiding this comment.
@jterry75 I wonder if we should retry indefinitely or should we enforce a limit, say 12000 iterations (20 minutes). In such a case, if the shim died but VM is running, the VM can self-terminate after 20 mins of inactivity.
While that is fragile, I feel that 20 mins of inactivity after blackout is critical failure path anyways.
What do you think?
There was a problem hiding this comment.
Good question for Justin. My instinct is to retry forever — if the shim died but the VM is still running, the GCS should keep trying until either the host comes back or the VM is torn down externally. A 20-minute timeout would leave the VM in a zombie state where it's running but can't be managed. That said, happy to add a configurable limit if Justin thinks there's a scenario where self-termination is better.
There was a problem hiding this comment.
No it's not zombie VM. IIRC if the gcs crashed, the VM will also exit as it's init process exited.
There was a problem hiding this comment.
Host controls the lifetime of these. Even for a zombie VM host would be responsible for cleaning up. So we should just loop forever. If there is no connection, and nobody comes and terminates this VM that's other bugs to deal with. But that shouldn't happen ever since we use the terminate on last handle closed.
| } | ||
| b.responseChan <- resp | ||
| // Check quitChan first to avoid sending to a dead bridge. | ||
| select { |
There was a problem hiding this comment.
This can possibly silently lead to dropping of in-flight messages on old bridge when the transition happens to the new connection.
This section also makes me realize that we do not need an elaborate publisher in-direction at all. Address of PublishNotification will remain valid at all times simply because the bridge instance is never re-created, only connections are reset.
You can just add a pending buffer to the bridge directly. Then PublishNotification just adds its message to the pending buffer and return. It blocks if the buffer is full. Another goroutine will have a select loop as this. It will either add to pending buffer if responseChan is nil. Otherwise it sends on responseChan. You can also have quitChan case.
This might simplify your design significantly. Do add the tests though for bridge to test this flow.
During live migration the vsock connection between the host and the GCS breaks when the VM moves to the destination node. The GCS bridge drops and cannot recover, leaving the guest unable to communicate with the new host. This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge after a connection loss. On each iteration a fresh Bridge and Mux are created while the Host state (containers, processes) persists across reconnections. A Publisher abstraction is added to bridge/publisher.go so that container wait goroutines spawned during CreateContainer can route exit notifications through the current bridge. When the bridge is down between reconnect iterations, notifications are dropped with a warning — the host-side shim re-queries container state after reconnecting. The defer ordering in ListenAndServe is fixed so that quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from panicking on a dead bridge. Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration setup (Node_1 -> Node_2). Migration completes at 100% and container exec works on the destination node after migration. Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
c45e67a to
c4f32c1
Compare
- Inline publisher fields (notifyMu, connected, pendingNotifications) and methods (publishNotification, drainPendingNotifications, disconnectNotifications) directly into the Bridge struct, removing the separate publisher.go file. - Simplify PublishNotification to a direct channel send now that the publish/disconnect pattern ensures safe access. - Add LinuxGcsVsockPort constant to internal/guest/prot for use by the GCS reconnect loop (cannot import from internal/gcs/prot due to windows build tag). - Add 4 unit tests covering notification queuing, drain-on-reconnect, disconnect-after-drain, and full reconnect cycle. - Remove exported Connect/Disconnect methods (now handled internally by ListenAndServe lifecycle). Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
b5a35b7 to
30d3ca4
Compare
Fixes #2669
Problem
During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover —
ListenAndServereturns with an I/O error, and the GCS has no way to re-establish communication with the new host.What this does
Wraps the bridge serve call in a reconnect loop in
cmd/gcs/main.go. When the vsock connection drops, the GCS re-dials the host and callsListenAndServeagain on the same Bridge.ListenAndServealready creates fresh channels (responseChan,quitChan) on each call, so the Bridge can be reused across reconnections without resetting any state.The
Host(containers, processes, cgroups) persists across reconnections since it lives outside the Bridge.A
Publisheris added so that container wait goroutines — spawned duringCreateContainerand blocked onc.Wait()— can route exit notifications through whichever bridge is currently active. During the reconnect gap the notification is dropped, which is safe because the host-side shim re-queries container state after reconnecting.Design
No mutating RPCs (CreateContainer, ExecProcess, etc.) are in-flight when migration starts — the LM orchestrator ensures all container setup is complete before initiating migration. The only long-lived handler goroutine during migration is
waitOnProcessV2, which is blocked onselect { case exitCode := <-exitCodeChan }and doesn't touchresponseChanuntil the process exits (through Publisher). This means the Bridge can be safely reused acrossListenAndServecalls without risk of handler goroutines racing on channel state.During live migration the VM is frozen and only wakes up when the destination host shim is ready, so the vsock port should be immediately available. The reconnect loop uses a tight 100ms retry interval rather than exponential backoff.
The defer ordering in
ListenAndServeis fixed soquitChancloses beforeresponseChanbecomes invalid, andresponseChanis buffered to preventPublishNotificationfrom blocking on a dead bridge.Changes
cmd/gcs/main.gointernal/guest/bridge/bridge.goPublisherfield,ShutdownRequested(), fixed defer ordering, bufferedresponseChan, priority select guard inPublishNotificationinternal/guest/bridge/bridge_v2.goPublisher.Publish()internal/guest/bridge/publisher.gointernal/guest/bridge/publisher_test.goTesting
Tested on a two-node Hyper-V live migration setup using the
TwoNodeInfratest module:Invoke-FullLmTestCycle -Verbose— deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Containerlcow-testmigrated with pod sandbox intact.crictl exec— created an LCOW pod with our custom GCS (deployed viarootfs.vhd), started a container, exec'dcat /tmp/test.txtto verify bridge communication works after reconnect.go build,go vet,gofmtclean.