Place inner container cgroups under the pod's cgroup tree#169
Open
jamestoyer wants to merge 1 commit intocoder:mainfrom
Open
Place inner container cgroups under the pod's cgroup tree#169jamestoyer wants to merge 1 commit intocoder:mainfrom
jamestoyer wants to merge 1 commit intocoder:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #168
Summary
Wrap the dockerd invocation with
unshare --cgroup, a freshcgroup2remount, and the cgroup-delegation logic from moby'shack/dindscript. After the wrapper runs, dockerd's view of/sys/fs/cgroupis rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).
Changes
cli/docker.gowrapDockerdCmd(dargs)helper that returns theunshare-prefixed command + args. The shell snippet does:umount /sys/fs/cgroup+mount -t cgroup2 cgroup /sys/fs/cgroup— fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)mkdir /sys/fs/cgroup/init+ retry loop that moves all processes viaxargs -rn1 < cgroup.procs > init/cgroup.procsand enables every available controller incgroup.subtree_control— needed because cgroupv2 forbids a cgroup from having both processes andsubtree_controlsetexecinto dockerd with the original argsdockerd(initial start and the twoIsNoSpaceErrrecovery paths) to go through the wrapper.xunix/sys.goreadCPUQuotaCGroupV2: if/sys/fs/cgroup/<self>/cpu.maxdoesn't exist (because the cgroup remount above has reparented the mount root), read/sys/fs/cgroup/cpu.maxdirectly. The mount root IS the current cgroup in that case so the values are equivalent.Why no
--mounton unshare?unshare --cgroupalone leaves the mount namespace shared with the parent envbox process, so theumount+mount -t cgroup2operations leak back to envbox. We tried isolating with--mount:--propagation private→ breaks sysbox-fs: its per-container mounts under/var/lib/sysboxfs/<id>/stop being visible tosysbox-runcin the dockerd namespace and inner-container creation fails.--propagation slave→ same failure: the parent mount points areprivateby default, so nothing propagates to the slave child.Making
/var/lib/sysboxfs/(or/)rsharedin envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via thexunix/sys.gofallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.Verification
Tested against:
--enable-cgtrackerid=trueEnd-to-end check: after a workspace start, running
whoamifrom a Coder terminal (inside the inner workspace container) now produces a Tetragonprocess_execevent with a fullpodfield — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had nopodfield at all.Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its
/sysand/procvirtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's.scopecgroup is now satisfied because all envbox/sysbox processes are first migrated into the/initsibling cgroup.Backwards compatibility
The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:
if [ -f /sys/fs/cgroup/cgroup.controllers ]guard skips the delegation block; only the umount/mount happens.unshare --cgroup(< 4.6, very rare today): would surface as a startup error.For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.
References
hack/dindscript (cgroup delegation logic borrowed from there)cgtracker(consumes the new hierarchy on the agent side)