Skip to content

fix(dns): preserve NRPT rules on startup and improve hosts file retry#114

Draft
f0ssel wants to merge 1 commit intomainfrom
fix/plat-110-dns-resilience
Draft

fix(dns): preserve NRPT rules on startup and improve hosts file retry#114
f0ssel wants to merge 1 commit intomainfrom
fix/plat-110-dns-resilience

Conversation

@f0ssel
Copy link
Copy Markdown
Member

@f0ssel f0ssel commented Apr 10, 2026

Two changes to improve DNS resilience on Windows when GlobalProtect or other VPNs cause engine restarts.

Problem

When the Coder Desktop tunnel binary restarts (e.g., after a GlobalProtect VPN reconnect), the Tailscale engine is recreated. During this process:

  1. newNRPTRuleDatabase() called DelAllRuleKeys() unconditionally, deleting all .coder NRPT routing rules from the Windows registry — even though they were valid and working.
  2. The first dns.Set() call often fails because the hosts file is locked by endpoint security (GP, CrowdStrike, Defender), and the retry window was only 50ms (5×10ms).
  3. If the Coder API is also unreachable (route through the reconnecting VPN), no workspace snapshot arrives to reprogram DNS — leaving resolution broken indefinitely.

Fix

1. Preserve NRPT rules on startup (nrpt_windows.go)

  • newNRPTRuleDatabase() no longer calls DelAllRuleKeys()
  • Existing rule IDs are tracked in a new staleRuleIDs field
  • Stale rules are cleaned up in the first WriteSplitDNSConfig() call, after replacement rules are confirmed written
  • DelAllRuleKeys() also cleans up stale rules if called directly (e.g., during teardown)
  • This preserves .coder DNS routing during the gap between engine creation and first successful configuration

2. Exponential backoff for hosts file retry (manager_windows.go)

  • Initial backoff: 50ms, multiplier: 2x, max interval: 5s, deadline: 30s
  • Each retry attempt is logged at debug level with attempt number and error
  • Final error includes total attempt count for diagnostics

Fixes https://linear.app/codercom/issue/PLAT-110
Companion to coder/coder#24253

🤖 Generated by Coder Agents

Two changes to improve DNS resilience on Windows when GlobalProtect
or other VPNs cause engine restarts:

1. Don't delete NRPT rules in nrptRuleDatabase constructor.
   Previously, newNRPTRuleDatabase() called DelAllRuleKeys()
   unconditionally, wiping all .coder DNS routing rules on engine
   startup. Now, existing rules are tracked as 'stale' and only
   cleaned up after replacement rules are confirmed written in
   WriteSplitDNSConfig(). This preserves DNS routing during the
   gap between engine creation and first successful configuration.

2. Replace hosts file retry (5x10ms) with exponential backoff.
   Endpoint security tools (GlobalProtect, CrowdStrike, Windows
   Defender) can hold the hosts file for seconds. The new retry
   uses 50ms initial backoff, 2x multiplier, 5s max, 30s deadline,
   with debug logging on each attempt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant