fix(cluster): declare openshell namespace via k3s auto-manifest#871
Open
latenighthackathon wants to merge 1 commit intoNVIDIA:mainfrom
Open
fix(cluster): declare openshell namespace via k3s auto-manifest#871latenighthackathon wants to merge 1 commit intoNVIDIA:mainfrom
latenighthackathon wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
20584f9 to
5764450
Compare
reconcile_pki calls wait_for_namespace("openshell") with a ~115s budget
(60 attempts, 200ms→2s backoff) before the PKI phase can read or write
secrets. Today the namespace is created only by the k3s Helm controller
reconciling openshell-helmchart.yaml with createNamespace: true. On slow
networks, cold boots, or when the chart tarball download stalls, the
Helm controller can easily exceed that budget, producing:
Error: × K8s namespace not ready
╰─▶ timed out waiting for namespace 'openshell' to exist: Error from
server (NotFound): namespaces "openshell" not found
k3s auto-applies every YAML in /var/lib/rancher/k3s/server/manifests/
as soon as its API server is ready, before any Helm reconciliation.
A standalone Namespace manifest guarantees the namespace exists within
seconds of cluster startup, decoupled from Helm controller latency.
createNamespace: true on the HelmChart stays as an idempotent fallback
— Helm's --create-namespace coexists with pre-existing namespaces
without error.
Also updates the openshell-vm rootfs builder to include the new manifest
in its explicit copy list; the docker cluster-entrypoint picks it up
automatically via its *.yaml glob.
Docs:
- architecture/gateway-single-node.md lists the new manifest and
explains why it exists independently of the HelmChart CR.
Tests:
- Unit test in openshell-bootstrap compile-time embeds the manifest via
include_str! and asserts apiVersion/kind/metadata.name. include_str!
fails the build if the file is deleted or moved; the string checks
catch drift in the fields wait_for_namespace depends on.
- E2E test asserts `kubectl get namespace openshell` returns
`namespace/openshell` and that `status.phase == Active` against a
healthy gateway, rejecting a Terminating namespace or a transient
empty API response that would pass a bare existence check.
Closes NVIDIA/NemoClaw#1974
Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
5764450 to
f5118b6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reconcile_pkicallswait_for_namespace("openshell")with a ~115 s budget before the PKI phase can read or write secrets. Today the namespace is created only by the k3s Helm controller reconcilingopenshell-helmchart.yamlwithcreateNamespace: true. On slow networks, cold boots, or stalled chart downloads the Helm controller can exceed that budget, causing the gateway to fail with:Declaring the namespace as a standalone auto-applied manifest makes k3s create it within seconds of the API server becoming ready — decoupled from Helm controller latency.
Related Issue
Closes NVIDIA/NemoClaw#1974
Changes
deploy/kube/manifests/openshell-namespace.yaml— a minimalkind: Namespacemanifest with SPDX header. k3s auto-applies everything in/var/lib/rancher/k3s/server/manifests/on startup, before Helm reconciliation.crates/openshell-vm/scripts/build-rootfs.shto include the new file in its explicit manifest copy list. The docker path incluster-entrypoint.shuses a*.yamlglob and picks it up automatically.crates/openshell-bootstrap/src/lib.rs— compile-time embeds the manifest viainclude_str!and assertsapiVersion,kind, andmetadata.name. Fails the build if the file is deleted/renamed; fails the test if any of the three fieldswait_for_namespacedepends on drift.e2e/rust/tests/namespace_bootstrap.rs— against a healthy gateway, assertskubectl get namespace openshellreturnsnamespace/openshelland thatstatus.phase == Active. The phase check rejects aTerminatingnamespace from a tear-down or an empty response from a transient API error.architecture/gateway-single-node.md— lists the new manifest in the bundled-manifests section and explains why it exists independently of the HelmChart CR.createNamespace: trueon the HelmChart is retained as an idempotent fallback — Helm's--create-namespacecoexists with pre-existing namespaces without error.Testing
cargo test -p openshell-bootstrap --lib— 109 passed / 0 failed, including the newopenshell_namespace_manifest_is_present_and_well_formedcargo check --tests --features e2e(ine2e/rust/) — new e2e suite compiles cleanlyrancher/k3s:v1.29.8-k3s1— dropped the new manifest into/var/lib/rancher/k3s/server/manifests/on a running k3s container. k3s applied it within 6 ms (ApplyingManifest→AppliedManifestper the addon controller events), andkubectl get namespace openshell -o namereturnednamespace/openshellwithstatus.phase == Active. This mirrors exactly whatcluster-entrypoint.sh'scp "$manifest" "$K3S_MANIFESTS/"step does, using a stock k3s image.mise run license:check— SPDX headers present on all new filesmise run helm:lint— no regression on the openshell chartmise run docs— architecture + Fern docs validatebash -n crates/openshell-vm/scripts/build-rootfs.sh— syntax OKapiVersion: v1/kind: Namespace/metadata.name: openshellChecklist
gateway-single-node.md)