Place inner container cgroups under the pod's cgroup tree by jamestoyer · Pull Request #169 · coder/envbox

jamestoyer · 2026-04-27T12:03:02Z

Closes #168

Summary

Wrap the dockerd invocation with unshare --cgroup, a fresh cgroup2 remount, and the cgroup-delegation logic from moby's hack/dind script. After the wrapper runs, dockerd's view of /sys/fs/cgroup is rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.

This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).

Changes

cli/docker.go

Adds wrapDockerdCmd(dargs) helper that returns the unshare-prefixed command + args. The shell snippet does:
- umount /sys/fs/cgroup + mount -t cgroup2 cgroup /sys/fs/cgroup — fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)
- mkdir /sys/fs/cgroup/init + retry loop that moves all processes via xargs -rn1 < cgroup.procs > init/cgroup.procs and enables every available controller in cgroup.subtree_control — needed because cgroupv2 forbids a cgroup from having both processes and subtree_control set
- exec into dockerd with the original args
Updates the three sites that launch/restart dockerd (initial start and the two IsNoSpaceErr recovery paths) to go through the wrapper.

xunix/sys.go

Adds a fallback in readCPUQuotaCGroupV2: if /sys/fs/cgroup/<self>/cpu.max doesn't exist (because the cgroup remount above has reparented the mount root), read /sys/fs/cgroup/cpu.max directly. The mount root IS the current cgroup in that case so the values are equivalent.

Why no `--mount` on unshare?

unshare --cgroup alone leaves the mount namespace shared with the parent envbox process, so the umount + mount -t cgroup2 operations leak back to envbox. We tried isolating with --mount:

--propagation private → breaks sysbox-fs: its per-container mounts under /var/lib/sysboxfs/<id>/ stop being visible to sysbox-runc in the dockerd namespace and inner-container creation fails.
--propagation slave → same failure: the parent mount points are private by default, so nothing propagates to the slave child.

Making /var/lib/sysboxfs/ (or /) rshared in envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via the xunix/sys.go fallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.

Verification

Tested against:

containerd 2.2.1
Linux 6.8 / Ubuntu 22.04 with cgroup v2
Tetragon v1.18.1 with --enable-cgtrackerid=true

End-to-end check: after a workspace start, running whoami from a Coder terminal (inside the inner workspace container) now produces a Tetragon process_exec event with a full pod field — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had no pod field at all.

Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its /sys and /proc virtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's .scope cgroup is now satisfied because all envbox/sysbox processes are first migrated into the /init sibling cgroup.

Backwards compatibility

The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:

cgroup v1 hosts: the if [ -f /sys/fs/cgroup/cgroup.controllers ] guard skips the delegation block; only the umount/mount happens.
Kernels without unshare --cgroup (< 4.6, very rare today): would surface as a startup error.

For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.

References

DinD cgroupv2 problem inside K8s moby/moby#45378 (and Isovalent's comment that inspired this)
moby's hack/dind script (cgroup delegation logic borrowed from there)
Cilium Tetragon's cgtracker (consumes the new hierarchy on the agent side)

Place inner container cgroups under the pod's cgroup tree

c5aa82e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Place inner container cgroups under the pod's cgroup tree#169

Place inner container cgroups under the pod's cgroup tree#169
jamestoyer wants to merge 1 commit intocoder:mainfrom
jamestoyer:feat/dind-cgroup-attribution

jamestoyer commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamestoyer commented Apr 27, 2026

Summary

Changes

Why no --mount on unshare?

Verification

Backwards compatibility

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why no `--mount` on unshare?