Chapter 102: Container fundamentals: namespaces, cgroups, OCI spec

Chapter 101 produced the artifact — a hermetic, statically linked binary ready to ship. This chapter explains what the container that wraps that binary actually is. Not the marketing version (“lightweight VM”). The real version: a Linux process started with a few kernel flags set and some configuration files describing how to start it. Every operational question in the next chapters (OCI lifecycle in 104, GitOps in 105, autoscaling in 49, Kubernetes at large) is grounded in the primitives here.

By the end of this chapter the reader can explain what a namespace is, what a cgroup is, the difference between cgroups v1 and v2, what the OCI spec standardizes, why distroless and Wolfi and Chainguard images matter, how multi-stage Dockerfiles work, and what a container runtime actually does. The knowledge threshold is “could debug a broken container from first principles, with nothing but strace and /proc.”

Outline:

A container is not a VM.
Linux namespaces — six isolation axes.
cgroups — the budget layer.
cgroups v1 vs v2 and why it matters.
The OCI image spec and the runtime spec.
Distroless, scratch, Wolfi, Chainguard.
Multi-stage Dockerfiles.
The container runtime layer — containerd, CRI-O, runc.
Capabilities, seccomp, and the security surface.
The mental model.

102.1 A container is not a VM

The biggest conceptual mistake engineers make with containers is treating them as lightweight VMs. They are not. A VM has its own kernel, its own virtual hardware (disks, NICs, CPUs), and communicates with the host only through the hypervisor. A container shares the host kernel. Every syscall a container makes is handled by the same Linux kernel running every other container on the box. There is no virtualization of the CPU, the memory, the network stack, or the filesystem — only the view of them that a process sees.

Concretely: start a container. It runs a process. That process is visible to the host with ps aux | grep <pid>. It is not emulated. It is a normal Linux process whose clone() syscall at startup passed flags like CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWNET. Those flags tell the kernel “give this process its own view of the PID tree, its own view of the mount table, its own network stack.” The kernel does the bookkeeping. The process does not know it is containerized; it thinks it is alone on the machine. But underneath it is just a process.

This has real consequences. A container cannot have a different kernel than the host — uname -r inside any container returns the host’s kernel version. A container can crash the host with a kernel bug because they share a kernel. A container inherits the host’s scheduler and memory allocator. The isolation is cooperative; it is the kernel enforcing rules on itself, not a hardware boundary.

A VM has a hardware boundary and its own kernel; a container is a plain Linux process whose isolation is enforced by the host kernel via namespace and cgroup flags — faster and lighter, but not a separate kernel.

The benefits of this model are massive. No hypervisor overhead. Start time measured in milliseconds, not seconds. A container’s memory footprint is its process’s memory footprint plus a tiny bookkeeping overhead, not a full kernel. You can run hundreds of containers on a box where you could run a dozen VMs. The price is weaker isolation. For multi-tenant untrusted workloads (cloud-hosted customer code), the industry uses “sandboxed containers” like gVisor or Firecracker that add a layer of hardware isolation on top, because the kernel-sharing model is not enough.

102.2 Linux namespaces — seven isolation axes

Namespaces are the “different view” part of containerization. Linux has seven user-space-visible namespaces as of kernel 5.x, and a container uses most of them. Each namespace isolates one class of system resource:

PID namespace (CLONE_NEWPID). The process tree. Inside a new PID namespace, the first process gets PID 1. It cannot see processes outside the namespace. kill -9 1 from inside a container does not kill the host’s init.
Mount namespace (CLONE_NEWNS). The mount table. The container sees only the mounts the runtime set up — typically an overlay filesystem rooted at the image, plus a few bind mounts from the host (e.g., /etc/hosts, /etc/resolv.conf).
Network namespace (CLONE_NEWNET). The network stack. The container has its own interfaces, routing table, iptables rules, socket tables. Connecting the container to the outside world means creating a veth pair, one end in the host namespace and one end in the container’s, and wiring it through a bridge (this is how Docker’s default bridge network works).
UTS namespace (CLONE_NEWUTS). Hostname and domain name. The container can hostname without affecting the host.
IPC namespace (CLONE_NEWIPC). SysV IPC, POSIX message queues, shared memory segments. The container cannot see the host’s or other containers’ IPC.
User namespace (CLONE_NEWUSER). User and group IDs. This is the most complex namespace. It maps UIDs inside the container to UIDs outside. root inside the container (UID 0) can be mapped to an unprivileged UID (e.g., 100000) on the host. This is how rootless containers work.
Cgroup namespace (CLONE_NEWCGROUP). The view of the cgroup hierarchy. Less commonly discussed; it prevents a container from seeing the full cgroup tree of the host.

A container is created by calling clone() with all the relevant CLONE_NEW* flags at once. The new process runs in a completely different world from its perspective. To see a real container’s namespaces from the host:

ls -l /proc/<pid>/ns/
# lrwxrwxrwx 1 root root 0 Apr 10 10:00 cgroup -> 'cgroup:[4026532987]'
# lrwxrwxrwx 1 root root 0 Apr 10 10:00 ipc    -> 'ipc:[4026532983]'
# lrwxrwxrwx 1 root root 0 Apr 10 10:00 mnt    -> 'mnt:[4026532981]'
# lrwxrwxrwx 1 root root 0 Apr 10 10:00 net    -> 'net:[4026532985]'
# lrwxrwxrwx 1 root root 0 Apr 10 10:00 pid    -> 'pid:[4026532984]'
# lrwxrwxrwx 1 root root 0 Apr 10 10:00 uts    -> 'uts:[4026532982]'

Each number is a namespace ID. Two processes with the same ID share that namespace. Two with different IDs are isolated. The nsenter command lets you enter an existing namespace; unshare creates a new one. These are the primitives under docker exec.

Seven namespace types passed to clone() at startup give a container process its own isolated view of the PID tree, mounts, network, hostname, IPC, UIDs, and cgroup hierarchy — each independently revocable and inspectable from the host.

102.3 cgroups — the budget layer

Namespaces give a container its own view of the system. cgroups give it a budget. A cgroup (control group) is a kernel mechanism that limits, accounts for, and prioritizes the resource use of a set of processes. Without cgroups, a container could consume the host’s entire CPU, memory, disk I/O, and network bandwidth. With cgroups, the runtime declares “this group of processes gets at most 2 CPU cores and 4 GB of memory” and the kernel enforces it.

The resource controllers (called “subsystems” in v1, just “controllers” in v2) are:

cpu: share of CPU time, via CFS bandwidth control (cpu.max in v2).
memory: max resident memory, via the OOM killer or the page reclaimer.
io: block-device I/O bandwidth, via io.max and io.weight.
pids: max number of processes in the group (prevents fork bombs).
cpuset: which specific CPUs and NUMA nodes the group can use.
hugetlb: huge-page allocations.
rdma: RDMA resources.
devices: which device nodes the group can access.

A cgroup is a directory in a pseudo-filesystem. In v2 it is mounted at /sys/fs/cgroup. To create a group and limit its memory:

mkdir /sys/fs/cgroup/mycontainer
echo "2G" > /sys/fs/cgroup/mycontainer/memory.max
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs  # move current shell into it

The shell now cannot exceed 2 GB of RSS; if it tries, the OOM killer terminates it. Container runtimes do exactly this: for each container, create a cgroup, write the limits, move the container’s init process into it.

The accounting side matters as much as the limit side. memory.current, memory.peak, cpu.stat, io.stat all report usage. This is how container metrics are collected — a cadvisor process reads these files for every cgroup and exposes them to Prometheus. When KEDA (Chapter 51) scales on “CPU usage,” it is ultimately reading a cgroup counter.

102.4 cgroups v1 vs v2 and why it matters

cgroups has two versions, and the difference is not purely academic. v1 was the original design from 2008. It gave each controller its own hierarchy — a process could be in “cpu cgroup A, memory cgroup B, io cgroup C” simultaneously. The flexibility led to inconsistency: different controllers evolved different semantics, and the interactions were hard to reason about. A process’s actual constraints depended on a Cartesian product of controllers.

v2 was merged in kernel 4.5 (2016) and has a single unified hierarchy — every process is in one cgroup, and that cgroup enables whichever controllers it needs. The controller semantics were rationalized. The interfaces were cleaned up. Memory controller got real support for kernel memory and slab accounting. The io controller got a proper cost-based model.

The migration has been slow. Many systems were still on v1 until ~2022 because distros defaulted to v1 for backward compatibility. As of 2024-2025 almost all modern Linux distros default to v2, and Kubernetes 1.25+ assumes v2 by default. Running v1 in 2026 is a red flag — it means old tooling that hasn’t been updated.

Why it matters for ML platforms. On v1, the memory controller does not correctly account for the GPU memory reservations or the kernel pages used by GPU drivers. Memory pressure inside a container does not trigger the expected behavior because the accounting is incomplete. On v2, the accounting is much more consistent. NVIDIA’s device plugin, vLLM, and containerd all have v2-specific codepaths. Running LLM serving on v1 in 2026 invites obscure memory-accounting bugs that cannot be reproduced.

The practical implication: make sure your nodes run cgroups v2. stat -fc %T /sys/fs/cgroup/ returns cgroup2fs on v2 and tmpfs on v1. Every node in your K8s cluster should return the former.

102.5 The OCI image spec and the runtime spec

The Open Container Initiative (OCI) is the standards body that defines what a “container” means across tools. There are two relevant specs: the image spec and the runtime spec.

The image spec defines what is in a container image. An image is a set of tarballs (layers) plus a JSON manifest plus a JSON config. The manifest lists the layers by digest; the config specifies the entrypoint, environment variables, working directory, user, and exposed ports. Layers are stacked on top of each other in an overlay filesystem at runtime — the bottom layer might be debian-slim, the next layer might be /usr/local/lib/libpython, the top layer might be your application code. Each layer is content-addressed by its SHA256; identical layers across images are stored once. Chapter 106 goes deep on this.

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:e3b0c44...",
    "size": 7023
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:a591fd4...",
      "size": 32654123
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:b2d1e9a...",
      "size": 1245678
    }
  ]
}

The runtime spec defines what happens when you “run” an image. It specifies a directory structure called a “bundle” containing a config.json (the OCI runtime config) and a rootfs/ (the filesystem the container will see). The runtime config declares namespaces, cgroups, mounts, capabilities, seccomp profiles, and the process to run. Any OCI-compliant runtime (runc, crun, youki) can take this bundle and start a container from it.

The split matters because it standardizes the interface. Docker, containerd, podman, CRI-O, Kubernetes — all of them use OCI images and OCI runtimes. The image you build with docker build can run with podman run or crun or be unpacked by ctr and executed by runc directly. The standard is what makes the ecosystem cohere.

102.6 Distroless, scratch, Wolfi, Chainguard

The base image of a container is the bottom layer — what’s there before you add your application. Historical default: Ubuntu or Debian, ~200 MB, including a full userspace with shells, package managers, coreutils, libc, libssl, every dependency the distro maintainers bundle. For running a Go binary this is absurd overhead. None of those tools are needed at runtime.

Distroless is Google’s minimal base image family. gcr.io/distroless/static has nothing but /etc/ssl/certs, /etc/passwd, a few user entries, and a /etc/os-release. ~2 MB. No shell, no ls, no package manager. Your binary lands in this image and runs. gcr.io/distroless/base adds glibc for binaries that dynamically link. gcr.io/distroless/python3 adds a Python runtime. Each is the minimum possible surface for its use case.

Scratch is even more minimal — it is literally the empty image. FROM scratch means “my image has no parent layers.” A Go binary built with CGO off and static linking can run in scratch. It cannot resolve hostnames via DNS without /etc/resolv.conf and the CA cert bundle, so in practice distroless/static is more useful — it provides those essentials.

Wolfi (from Chainguard) is a minimal, security-focused Linux distribution built specifically for containers. Packages are signed and built with full provenance. Images are kept at zero CVEs (or very close) by tracking upstream security advisories aggressively and rebuilding within hours of a fix. Chainguard Images are commercial distroless-style images built on Wolfi. They are the state of the art if you care about supply-chain security and do not want to run apt-get upgrade yourself.

The rule of thumb: start with distroless or a Chainguard image. Every MB you add to a base is attack surface you are responsible for. A bash binary in your image means an attacker who finds an RCE can pop a shell. A Chainguard image with zero binaries means an attacker who finds an RCE has no shell to pop.

102.7 Multi-stage Dockerfiles

The standard pattern for producing a distroless image is a multi-stage build. Stage one is a “builder” that contains the compiler toolchain and all the dependencies. Stage two is the runtime image, which copies only the compiled artifact from stage one. The runtime image never contains the compiler or build dependencies.

The builder stage (compiler, modules, tools) is thrown away; only the runtime stage is shipped — producing an image an order of magnitude smaller with no build tooling in the attack surface.

# Stage 1: builder
FROM golang:1.22 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build \
    -ldflags="-s -w -X main.version=$(git describe --tags)" \
    -o /out/server ./cmd/server

# Stage 2: runtime
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /out/server /server
USER nonroot:nonroot
EXPOSE 8080
ENTRYPOINT ["/server"]

The final image has the Go binary and nothing else. The builder stage (with Go, gcc, build tools, the full Debian base) is discarded by the Docker daemon after the build; only the final stage’s layers are kept.

A Python example is harder because Python needs the interpreter at runtime. The pattern:

FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install --no-cache-dir uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

FROM gcr.io/distroless/python3-debian12:nonroot
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY src/ /app/src/
ENV PATH="/app/.venv/bin:$PATH"
USER nonroot
ENTRYPOINT ["python", "-m", "src.server"]

The builder installs Python dependencies into a virtualenv using uv (Chapter 105). The runtime image is a distroless Python image with the virtualenv copied over and the source code. No pip, no build tools, no apt.

Multi-stage builds also enable layer caching. The COPY go.mod go.sum ./ and RUN go mod download steps are cached as long as the go.mod hasn’t changed, so adding a new source file only invalidates the final COPY . . layer. On a warm cache, a rebuild takes seconds.

102.8 The container runtime layer — containerd, CRI-O, runc

“Container runtime” is an overloaded term. There are actually two layers. The low-level runtime (runc, crun, youki) is the thing that actually makes kernel syscalls to set up namespaces and cgroups. The high-level runtime (containerd, CRI-O, dockerd) is the thing that manages image pulls, image storage, network setup, and calls the low-level runtime to start the container.

runc is the reference low-level runtime, written in Go, originally extracted from Docker. It reads an OCI runtime config, sets up the bundle, and calls clone() with the right flags. crun is a C rewrite, faster and smaller. youki is a Rust rewrite, newer. All three implement the same OCI runtime spec, and you can swap between them.

graph TD
  kubelet[kubelet] -->|CRI gRPC| containerd[containerd<br/>high-level runtime]
  containerd -->|OCI bundle| runc[runc / crun<br/>low-level runtime]
  runc -->|clone syscall| kernel[Linux kernel]
  kernel --> ns[namespaces + cgroups]
  style containerd fill:var(--fig-surface),stroke:var(--fig-border)
  style runc fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The container runtime stack has two layers: the high-level runtime (containerd) manages images and networking; the low-level runtime (runc) makes the actual kernel syscalls — knowing this stack lets you debug “ContainerCreating” from the right layer.

containerd is the high-level runtime used by Kubernetes. Its job is to manage the image store (pulling images from registries, unpacking layers, deduplicating content), set up the network namespace and the CNI plugin, and orchestrate runc to start the container. Kubernetes talks to containerd through the Container Runtime Interface (CRI), a gRPC API. dockerd is the original Docker daemon; it does what containerd does plus a bunch of user-facing conveniences. As of Kubernetes 1.24, dockershim is gone and K8s talks to containerd directly.

CRI-O is an alternative high-level runtime, built by Red Hat, more minimal than containerd. It does less (no image building, no Docker compatibility) and targets only Kubernetes. For a K8s cluster you pick containerd or CRI-O; both work fine; containerd is more common.

The operational consequence: when a pod starts on a node, the sequence is kubelet → CRI → containerd → runc → kernel clone(). When something goes wrong at pod start, the question is which layer. crictl is the debugger for the CRI layer; ctr for containerd; runc has its own CLI. Knowing this stack is how you debug a container that is stuck in ContainerCreating.

102.9 Capabilities, seccomp, and the security surface

A container running as root has too many privileges. Linux “capabilities” are a fine-grained privilege model: instead of root vs non-root, there are about 40 distinct capabilities (CAP_NET_ADMIN, CAP_SYS_ADMIN, CAP_NET_BIND_SERVICE, CAP_CHOWN, etc.), and a process has a subset of them. Docker and Kubernetes drop most capabilities by default; a container process can bind() to a port above 1024 and read files owned by its user, but it cannot mount filesystems or manipulate network interfaces.

The Kubernetes securityContext is how you declare this:

securityContext:
  runAsNonRoot: true
  runAsUser: 10000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]  # only if you need to bind below 1024
  seccompProfile:
    type: RuntimeDefault

seccomp is a kernel facility that filters syscalls. A seccomp profile is a list of allowed syscalls; everything else gets EPERM or kills the process. The RuntimeDefault profile in containerd blocks about 50 dangerous syscalls (like mount, reboot, kexec_load) that a normal application never uses. Docker’s default seccomp profile is similar. Custom profiles can lock things down further for high-security workloads.

AppArmor and SELinux are the mandatory-access-control layers on top. They add path-based or label-based rules (“this container can only read files matching /app/*”). For most workloads, capabilities + seccomp + read-only root filesystem + non-root user is enough. For hardened workloads, add AppArmor or run on a distro with SELinux enforcing.

The failure mode to avoid: running everything as root with all capabilities because “it’s just internal.” An attacker who compromises one container then has full privileges to escape, inspect the host, and pivot. Every production container should run as non-root, with capabilities dropped, with a read-only root filesystem, with seccomp enabled. This is the default for modern base images and Kubernetes pod security standards.

102.10 The mental model

Eight points to take into Chapter 103:

A container is a Linux process with namespace and cgroup flags set, not a lightweight VM. It shares the host kernel.
Namespaces isolate the view. PID, mount, network, UTS, IPC, user, cgroup — seven axes.
cgroups limit the budget. CPU, memory, io, pids — per-container, enforced by the kernel.
Use cgroups v2. v1 has inconsistent semantics and incomplete accounting for GPU workloads.
OCI splits image and runtime specs. Image: layers + manifest + config. Runtime: bundle + config.json + rootfs.
Distroless, Wolfi, Chainguard. Minimal base images are the default for production — less surface, fewer CVEs.
Multi-stage Dockerfiles separate build from runtime. The runtime image never contains the compiler.
Security defaults: non-root, drop all caps, read-only root FS, seccomp RuntimeDefault. Always, not sometimes.

Chapter 103 changes gears — from the container that runs code to the structure inside the code itself, specifically how services wire their dependencies together at build and runtime.

Read it yourself

Liz Rice, Container Security (O’Reilly, 2020). Deep dive on the kernel primitives and security surface.
The OCI Image Specification and The OCI Runtime Specification, both at github.com/opencontainers.
man 7 namespaces and man 7 cgroups on any modern Linux box — the canonical reference.
Julia Evans, Linux containers in 500 lines of code (blog post). Writes a toy container runtime from scratch; nothing clarifies the model faster.
The runc source code, specifically the libcontainer/nsinit package. A few hundred lines of Go that set up all the namespaces.
Chainguard’s blog on Wolfi and zero-CVE images, particularly the posts on supply-chain attestation.

Practice

Run unshare --pid --fork --mount-proc bash as root. What do you see in ps aux? Why?
Create a cgroup v2 group with a 100 MB memory limit, put a shell into it, and try to dd if=/dev/zero of=/dev/null bs=200M count=1. What happens?
Inspect the namespaces of a running Docker container with ls -l /proc/$(docker inspect -f '{{.State.Pid}}' <container>)/ns/. Compare to the host’s.
Write a Dockerfile that produces a <10 MB image for a Go “hello world.” Verify the size.
Run a container with runAsNonRoot: true but as root in the image. What does Kubernetes do? What error do you see?
Explain the difference between a namespace and a cgroup in one sentence each.
Stretch: Write a ~100-line Go program that uses syscall.Clone with CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS to spawn a shell in a new namespace. Make hostname in the child not affect the host.