Skip to content

bridge: add guest-side reconnect loop for live migration#2698

Merged
shreyanshjain7174 merged 2 commits intomicrosoft:mainfrom
shreyanshjain7174:bridge-reconnect-v2
Apr 28, 2026
Merged

bridge: add guest-side reconnect loop for live migration#2698
shreyanshjain7174 merged 2 commits intomicrosoft:mainfrom
shreyanshjain7174:bridge-reconnect-v2

Conversation

@shreyanshjain7174
Copy link
Copy Markdown
Contributor

@shreyanshjain7174 shreyanshjain7174 commented Apr 21, 2026

Fixes #2669

Problem

During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover — ListenAndServe returns with an I/O error, and the GCS has no way to re-establish communication with the new host.

What this does

Wraps the bridge serve call in a reconnect loop in cmd/gcs/main.go. When the vsock connection drops, the GCS re-dials the host and calls ListenAndServe again on the same Bridge. ListenAndServe already creates fresh channels (responseChan, quitChan) on each call, so the Bridge can be reused across reconnections without resetting any state.

The Host (containers, processes, cgroups) persists across reconnections since it lives outside the Bridge.

A Publisher is added so that container wait goroutines — spawned during CreateContainer and blocked on c.Wait() — can route exit notifications through whichever bridge is currently active. During the reconnect gap the notification is dropped, which is safe because the host-side shim re-queries container state after reconnecting.

Design

No mutating RPCs (CreateContainer, ExecProcess, etc.) are in-flight when migration starts — the LM orchestrator ensures all container setup is complete before initiating migration. The only long-lived handler goroutine during migration is waitOnProcessV2, which is blocked on select { case exitCode := <-exitCodeChan } and doesn't touch responseChan until the process exits (through Publisher). This means the Bridge can be safely reused across ListenAndServe calls without risk of handler goroutines racing on channel state.

During live migration the VM is frozen and only wakes up when the destination host shim is ready, so the vsock port should be immediately available. The reconnect loop uses a tight 100ms retry interval rather than exponential backoff.

The defer ordering in ListenAndServe is fixed so quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from blocking on a dead bridge.

Changes

File Change
cmd/gcs/main.go Reconnect loop with 100ms retry; Bridge+Mux created once outside the loop
internal/guest/bridge/bridge.go Publisher field, ShutdownRequested(), fixed defer ordering, buffered responseChan, priority select guard in PublishNotification
internal/guest/bridge/bridge_v2.go Container wait goroutine uses Publisher.Publish()
internal/guest/bridge/publisher.go Mutex-guarded bridge reference swap (40 lines)
internal/guest/bridge/publisher_test.go Tests for nil-bridge drop and bridge-set-publish

Testing

Tested on a two-node Hyper-V live migration setup using the TwoNodeInfra test module:

  • Invoke-FullLmTestCycle -Verbose — deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Container lcow-test migrated with pod sandbox intact.
  • Post-migration crictl exec — created an LCOW pod with our custom GCS (deployed via rootfs.vhd), started a container, exec'd cat /tmp/test.txt to verify bridge communication works after reconnect.
  • go build, go vet, gofmt clean.

@shreyanshjain7174 shreyanshjain7174 marked this pull request as ready for review April 21, 2026 17:28
@shreyanshjain7174 shreyanshjain7174 requested a review from a team as a code owner April 21, 2026 17:28
Comment thread cmd/gcs/main.go Outdated
Comment thread cmd/gcs/main.go
@shreyanshjain7174 shreyanshjain7174 force-pushed the bridge-reconnect-v2 branch 2 times, most recently from 05c7170 to 5fafdf4 Compare April 24, 2026 06:49
Comment thread internal/guest/bridge/bridge.go Outdated
Comment thread internal/guest/bridge/publisher.go Outdated
Copy link
Copy Markdown
Contributor

@rawahars rawahars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good.

Consider the below suggestion and see if that reduced the complexity and indirections.


We do not need a publisher in-direction at all. Address of PublishNotification on bridge will remain valid at all times simply because the bridge instance is never re-created, only connections are reset.

We can streamline this by adding a buffered channel (pendingMessages) directly to the bridge. PublishNotification will simply push messages to this buffer and return, naturally blocking to provide backpressure if the buffer hits capacity. We can then run a dedicated background goroutine with a select loop to process the flow: it reads from pending buffer and forwards to responseChan. By leveraging the nil-channel pattern, we can safely disable the send case when responseChan is nil, while also listening on a quitChan for graceful shutdown.

Something on the lines of-

type Bridge struct {
	// ... existing fields ...
	responseChan  chan bridgeResponse
	quitChan      chan bool
	
	// Add a buffered channel for backpressure
	pendingMessages chan bridgeResponse 
}

func (b *Bridge) PublishNotification(n *prot.ContainerNotification) {
	// ... (ctx/span setup) ...
	resp := bridgeResponse{ /* ... */ }

	// Publish pushes to the buffer and returns immediately
	b.pendingMessages <- resp 
}

// Background goroutine to handle the routing
// processNotifications runs in the background to route messages
func (b *Bridge) processNotifications() {
	// If responseChan is nil, readChan remains nil.
	// Go completely ignores select cases with nil channels, meaning 
	// we will never pop items from pendingBuffer. They stay queued.
	var readChan chan bridgeResponse
	if b.responseChan != nil {
		readChan = b.pendingMessages
	}

	for {
		select {
		case resp := <-readChan:
			// We only ever read from pendingBuffer if responseChan is present.
			b.responseChan <- resp
		case <-b.quitChan:
			return
		}
	}
}

Comment thread cmd/gcs/main.go
Comment thread internal/guest/bridge/bridge.go Outdated
Comment thread cmd/gcs/main.go
Comment thread cmd/gcs/main.go
Comment thread cmd/gcs/main.go
// re-dial. During live migration the VM is frozen and only wakes up when
// the destination host shim is ready, so the vsock port should be
// immediately available.
for {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jterry75 I wonder if we should retry indefinitely or should we enforce a limit, say 12000 iterations (20 minutes). In such a case, if the shim died but VM is running, the VM can self-terminate after 20 mins of inactivity.

While that is fragile, I feel that 20 mins of inactivity after blackout is critical failure path anyways.

What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question for Justin. My instinct is to retry forever — if the shim died but the VM is still running, the GCS should keep trying until either the host comes back or the VM is torn down externally. A 20-minute timeout would leave the VM in a zombie state where it's running but can't be managed. That said, happy to add a configurable limit if Justin thinks there's a scenario where self-termination is better.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not zombie VM. IIRC if the gcs crashed, the VM will also exit as it's init process exited.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host controls the lifetime of these. Even for a zombie VM host would be responsible for cleaning up. So we should just loop forever. If there is no connection, and nobody comes and terminates this VM that's other bugs to deal with. But that shouldn't happen ever since we use the terminate on last handle closed.

Comment thread internal/guest/bridge/bridge.go Outdated
Comment thread internal/guest/bridge/bridge.go Outdated
Comment thread internal/guest/bridge/bridge.go Outdated
}
b.responseChan <- resp
// Check quitChan first to avoid sending to a dead bridge.
select {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can possibly silently lead to dropping of in-flight messages on old bridge when the transition happens to the new connection.

This section also makes me realize that we do not need an elaborate publisher in-direction at all. Address of PublishNotification will remain valid at all times simply because the bridge instance is never re-created, only connections are reset.

You can just add a pending buffer to the bridge directly. Then PublishNotification just adds its message to the pending buffer and return. It blocks if the buffer is full. Another goroutine will have a select loop as this. It will either add to pending buffer if responseChan is nil. Otherwise it sends on responseChan. You can also have quitChan case.

This might simplify your design significantly. Do add the tests though for bridge to test this flow.

Comment thread cmd/gcs/main.go Outdated
During live migration the vsock connection between the host and the GCS
breaks when the VM moves to the destination node. The GCS bridge drops
and cannot recover, leaving the guest unable to communicate with the
new host.

This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge
after a connection loss. On each iteration a fresh Bridge and Mux are
created while the Host state (containers, processes) persists across
reconnections.

A Publisher abstraction is added to bridge/publisher.go so that container
wait goroutines spawned during CreateContainer can route exit notifications
through the current bridge. When the bridge is down between reconnect
iterations, notifications are dropped with a warning — the host-side shim
re-queries container state after reconnecting.

The defer ordering in ListenAndServe is fixed so that quitChan closes
before responseChan becomes invalid, and responseChan is buffered to
prevent PublishNotification from panicking on a dead bridge.

Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration
setup (Node_1 -> Node_2). Migration completes at 100% and container
exec works on the destination node after migration.

Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
Comment thread internal/guest/bridge/bridge.go Outdated
Comment thread cmd/gcs/main.go Outdated
- Inline publisher fields (notifyMu, connected, pendingNotifications) and
  methods (publishNotification, drainPendingNotifications,
  disconnectNotifications) directly into the Bridge struct, removing the
  separate publisher.go file.
- Simplify PublishNotification to a direct channel send now that the
  publish/disconnect pattern ensures safe access.
- Add LinuxGcsVsockPort constant to internal/guest/prot for use by the
  GCS reconnect loop (cannot import from internal/gcs/prot due to
  windows build tag).
- Add 4 unit tests covering notification queuing, drain-on-reconnect,
  disconnect-after-drain, and full reconnect cycle.
- Remove exported Connect/Disconnect methods (now handled internally
  by ListenAndServe lifecycle).

Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
@shreyanshjain7174 shreyanshjain7174 merged commit a4051da into microsoft:main Apr 28, 2026
32 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adds guest-side GCS changes for V2 shim support

3 participants