bridge: add guest-side reconnect loop for live migration by shreyanshjain7174 · Pull Request #2698 · microsoft/hcsshim

shreyanshjain7174 · 2026-04-21T17:27:06Z

Problem

During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover — ListenAndServe returns with an I/O error, and the GCS has no way to re-establish communication with the new host.

What this does

Wraps the bridge serve call in a reconnect loop in cmd/gcs/main.go. When the vsock connection drops, the GCS re-dials the host and calls ListenAndServe again on the same Bridge. ListenAndServe already creates fresh channels (responseChan, quitChan) on each call, so the Bridge can be reused across reconnections without resetting any state.

The Host (containers, processes, cgroups) persists across reconnections since it lives outside the Bridge.

A Publisher is added so that container wait goroutines — spawned during CreateContainer and blocked on c.Wait() — can route exit notifications through whichever bridge is currently active. During the reconnect gap the notification is dropped, which is safe because the host-side shim re-queries container state after reconnecting.

Design

No mutating RPCs (CreateContainer, ExecProcess, etc.) are in-flight when migration starts — the LM orchestrator ensures all container setup is complete before initiating migration. The only long-lived handler goroutine during migration is waitOnProcessV2, which is blocked on select { case exitCode := <-exitCodeChan } and doesn't touch responseChan until the process exits (through Publisher). This means the Bridge can be safely reused across ListenAndServe calls without risk of handler goroutines racing on channel state.

During live migration the VM is frozen and only wakes up when the destination host shim is ready, so the vsock port should be immediately available. The reconnect loop uses a tight 100ms retry interval rather than exponential backoff.

The defer ordering in ListenAndServe is fixed so quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from blocking on a dead bridge.

Changes

File	Change
`cmd/gcs/main.go`	Reconnect loop with 100ms retry; Bridge+Mux created once outside the loop
`internal/guest/bridge/bridge.go`	`Publisher` field, `ShutdownRequested()`, fixed defer ordering, buffered `responseChan`, priority select guard in `PublishNotification`
`internal/guest/bridge/bridge_v2.go`	Container wait goroutine uses `Publisher.Publish()`
`internal/guest/bridge/publisher.go`	Mutex-guarded bridge reference swap (40 lines)
`internal/guest/bridge/publisher_test.go`	Tests for nil-bridge drop and bridge-set-publish

Testing

Tested on a two-node Hyper-V live migration setup using the TwoNodeInfra test module:

Invoke-FullLmTestCycle -Verbose — deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Container lcow-test migrated with pod sandbox intact.
Post-migration crictl exec — created an LCOW pod with our custom GCS (deployed via rootfs.vhd), started a container, exec'd cat /tmp/test.txt to verify bridge communication works after reconnect.
go build, go vet, gofmt clean.

rawahars

Overall looking good.

Consider the below suggestion and see if that reduced the complexity and indirections.

We do not need a publisher in-direction at all. Address of PublishNotification on bridge will remain valid at all times simply because the bridge instance is never re-created, only connections are reset.

We can streamline this by adding a buffered channel (pendingMessages) directly to the bridge. PublishNotification will simply push messages to this buffer and return, naturally blocking to provide backpressure if the buffer hits capacity. We can then run a dedicated background goroutine with a select loop to process the flow: it reads from pending buffer and forwards to responseChan. By leveraging the nil-channel pattern, we can safely disable the send case when responseChan is nil, while also listening on a quitChan for graceful shutdown.

Something on the lines of-

type Bridge struct {
	// ... existing fields ...
	responseChan  chan bridgeResponse
	quitChan      chan bool
	
	// Add a buffered channel for backpressure
	pendingMessages chan bridgeResponse 
}

func (b *Bridge) PublishNotification(n *prot.ContainerNotification) {
	// ... (ctx/span setup) ...
	resp := bridgeResponse{ /* ... */ }

	// Publish pushes to the buffer and returns immediately
	b.pendingMessages <- resp 
}

// Background goroutine to handle the routing
// processNotifications runs in the background to route messages
func (b *Bridge) processNotifications() {
	// If responseChan is nil, readChan remains nil.
	// Go completely ignores select cases with nil channels, meaning 
	// we will never pop items from pendingBuffer. They stay queued.
	var readChan chan bridgeResponse
	if b.responseChan != nil {
		readChan = b.pendingMessages
	}

	for {
		select {
		case resp := <-readChan:
			// We only ever read from pendingBuffer if responseChan is present.
			b.responseChan <- resp
		case <-b.quitChan:
			return
		}
	}
}

rawahars · 2026-04-27T10:23:57Z

+	// re-dial. During live migration the VM is frozen and only wakes up when
+	// the destination host shim is ready, so the vsock port should be
+	// immediately available.
+	for {


@jterry75 I wonder if we should retry indefinitely or should we enforce a limit, say 12000 iterations (20 minutes). In such a case, if the shim died but VM is running, the VM can self-terminate after 20 mins of inactivity.

While that is fragile, I feel that 20 mins of inactivity after blackout is critical failure path anyways.

What do you think?

Good question for Justin. My instinct is to retry forever — if the shim died but the VM is still running, the GCS should keep trying until either the host comes back or the VM is torn down externally. A 20-minute timeout would leave the VM in a zombie state where it's running but can't be managed. That said, happy to add a configurable limit if Justin thinks there's a scenario where self-termination is better.

No it's not zombie VM. IIRC if the gcs crashed, the VM will also exit as it's init process exited.

Host controls the lifetime of these. Even for a zombie VM host would be responsible for cleaning up. So we should just loop forever. If there is no connection, and nobody comes and terminates this VM that's other bugs to deal with. But that shouldn't happen ever since we use the terminate on last handle closed.

rawahars · 2026-04-27T10:51:18Z

 	}
-	b.responseChan <- resp
+	// Check quitChan first to avoid sending to a dead bridge.
+	select {


This can possibly silently lead to dropping of in-flight messages on old bridge when the transition happens to the new connection.

This section also makes me realize that we do not need an elaborate publisher in-direction at all. Address of PublishNotification will remain valid at all times simply because the bridge instance is never re-created, only connections are reset.

You can just add a pending buffer to the bridge directly. Then PublishNotification just adds its message to the pending buffer and return. It blocks if the buffer is full. Another goroutine will have a select loop as this. It will either add to pending buffer if responseChan is nil. Otherwise it sends on responseChan. You can also have quitChan case.

This might simplify your design significantly. Do add the tests though for bridge to test this flow.

During live migration the vsock connection between the host and the GCS breaks when the VM moves to the destination node. The GCS bridge drops and cannot recover, leaving the guest unable to communicate with the new host. This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge after a connection loss. On each iteration a fresh Bridge and Mux are created while the Host state (containers, processes) persists across reconnections. A Publisher abstraction is added to bridge/publisher.go so that container wait goroutines spawned during CreateContainer can route exit notifications through the current bridge. When the bridge is down between reconnect iterations, notifications are dropped with a warning — the host-side shim re-queries container state after reconnecting. The defer ordering in ListenAndServe is fixed so that quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from panicking on a dead bridge. Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration setup (Node_1 -> Node_2). Migration completes at 100% and container exec works on the destination node after migration. Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>

- Inline publisher fields (notifyMu, connected, pendingNotifications) and methods (publishNotification, drainPendingNotifications, disconnectNotifications) directly into the Bridge struct, removing the separate publisher.go file. - Simplify PublishNotification to a direct channel send now that the publish/disconnect pattern ensures safe access. - Add LinuxGcsVsockPort constant to internal/guest/prot for use by the GCS reconnect loop (cannot import from internal/gcs/prot due to windows build tag). - Add 4 unit tests covering notification queuing, drain-on-reconnect, disconnect-after-drain, and full reconnect cycle. - Remove exported Connect/Disconnect methods (now handled internally by ListenAndServe lifecycle). Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>

shreyanshjain7174 mentioned this pull request Apr 21, 2026

Adds guest-side GCS changes for V2 shim support #2669

Open

shreyanshjain7174 marked this pull request as ready for review April 21, 2026 17:28

shreyanshjain7174 requested a review from a team as a code owner April 21, 2026 17:28

shreyanshjain7174 requested review from jterry75 and rawahars April 21, 2026 17:29

jterry75 reviewed Apr 22, 2026

View reviewed changes

Comment thread cmd/gcs/main.go Outdated

jterry75 reviewed Apr 22, 2026

View reviewed changes

Comment thread cmd/gcs/main.go

shreyanshjain7174 force-pushed the bridge-reconnect-v2 branch 2 times, most recently from 05c7170 to 5fafdf4 Compare April 24, 2026 06:49

shreyanshjain7174 requested a review from jterry75 April 24, 2026 06:59

jterry75 reviewed Apr 24, 2026

View reviewed changes

Comment thread internal/guest/bridge/bridge.go Outdated

jterry75 reviewed Apr 24, 2026

View reviewed changes

Comment thread internal/guest/bridge/publisher.go Outdated

jterry75 requested changes Apr 24, 2026

View reviewed changes

shreyanshjain7174 force-pushed the bridge-reconnect-v2 branch from 5fafdf4 to c45e67a Compare April 27, 2026 06:21

shreyanshjain7174 requested a review from jterry75 April 27, 2026 06:24

rawahars requested changes Apr 27, 2026

View reviewed changes

shreyanshjain7174 force-pushed the bridge-reconnect-v2 branch from c45e67a to c4f32c1 Compare April 27, 2026 12:14

shreyanshjain7174 requested a review from rawahars April 27, 2026 16:34

jterry75 approved these changes Apr 27, 2026

View reviewed changes

rawahars reviewed Apr 28, 2026

View reviewed changes

Comment thread internal/guest/bridge/bridge.go Outdated

Comment thread cmd/gcs/main.go Outdated

shreyanshjain7174 force-pushed the bridge-reconnect-v2 branch from b5a35b7 to 30d3ca4 Compare April 28, 2026 10:05

rawahars approved these changes Apr 28, 2026

View reviewed changes

shreyanshjain7174 merged commit a4051da into microsoft:main Apr 28, 2026
32 of 33 checks passed

Conversation

shreyanshjain7174 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this does

Design

Changes

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rawahars left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rawahars Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

shreyanshjain7174 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

rawahars Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

jterry75 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rawahars Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shreyanshjain7174 commented Apr 21, 2026 •

edited

Loading