Chapter 113: CI as a system: path filters, per-service builds, coverage gates

Continuous integration is the background hum of software engineering. Developers push a commit, something happens on a remote machine, and a green check or red X appears. When it works, nobody thinks about it. When it doesn’t work — when builds take forty minutes, when flaky tests block merges, when the wrong services rebuild on every commit, when coverage gates mysteriously stop enforcing — it is the single largest drag on team velocity. A senior engineer who can design CI to not be a drag is worth several who can build backend services.

The problem shape changes as the codebase grows. A 5-service repository needs almost nothing — go test ./... in a single job is fine. A 50-service monorepo needs a sophisticated pipeline that knows which services changed, fans out per-service jobs in parallel, shares build caches, enforces coverage gates, spins up ephemeral preview environments, and reports everything back to the PR in a way humans can scan in seconds. The gap between those two regimes is several quarters of engineering work, and the patterns that make the second tractable are the subject of this chapter.

This is also the Part IX capstone. Chapters 99-110 have covered everything from the build system upward: containers, DI, API contracts, Python tooling, the OCI lifecycle, GitOps, Helm/Kustomize/CDK8s, multi-cell architecture, IaC, secrets, edge ingress. This chapter ties it together with the thing that actually runs every time code changes — and the final paragraph of §113.10 recaps the whole arc before the references.

Outline:

CI as a distributed system.
Path filters and the monorepo problem.
Per-service fan-out and parallelism.
Build caches — the biggest lever.
Coverage gates and enforcement.
The blocking vs advisory line.
Ephemeral preview environments.
The CI platform landscape.
Flakiness and the retry question.
Recap of Part IX.
The mental model.

113.1 CI as a distributed system

Think about what CI actually is. On every push, an orchestrator receives a webhook, decides what work to do, schedules that work across a pool of runner machines, distributes the source code, fetches dependencies, runs the tests and builds, collects the results, reports them back to the source control system, and updates the commit status. All of this is distributed computation — workers, scheduling, state, caching, retry semantics, idempotency.

The failure modes are distributed-system failure modes: runners die, caches get corrupted, network fetches are flaky, the orchestrator drops jobs, the source control system’s webhook delivery is late, the status API has a 429. A naive CI pipeline runs as a single script and hits all of these with no recovery. A sophisticated pipeline recognizes them and handles each one.

The mental frame that helps: CI is a deterministic function of the source code at a commit. Given the same commit, the pipeline should produce the same outputs — the same build artifacts, the same test results, the same pass/fail. Non-determinism is a bug. Every flaky test, every “retry and it passes” incident, every “works on my laptop” failure is a violation of determinism, and each one costs real engineering time.

Determinism is expensive to achieve. It requires:

Pinned dependencies (lockfiles, digest-pinned container images — see Chapter 106).
Hermetic builds (no network during the build — see Chapter 101 on Bazel/Pants/Buck).
Controlled execution environment (specific runner image, specific tool versions).
Stable test ordering (or explicit parallelism that’s order-independent).
No clock dependencies (tests don’t depend on the current time).
No external service dependencies (tests don’t call real APIs).

Not every team needs full determinism, but every team benefits from moving toward it. The closer to deterministic, the faster debugging gets, and the more trust the team has in the CI signal.

113.2 Path filters and the monorepo problem

A 50-service monorepo has a problem: every push touches the repo, but most pushes only touch one service. Running the full build — all 50 services, all tests, all linters, all static analyses — on every commit is wasteful. Worst case, it takes an hour and the pipeline becomes unusable. Best case, it takes ten minutes and everyone’s idle time adds up.

The fix is path filters: conditionally skip build steps that don’t depend on the changed paths. The canonical GitHub Actions tool for this is dorny/paths-filter:

graph LR
  Push[git push] --> Filter[paths-filter job]
  Filter -->|services/api/** changed| JobA[build-api job]
  Filter -->|services/worker/** changed| JobB[build-worker job]
  Filter -->|apps/frontend/** changed| JobC[build-frontend job]
  Filter -->|nothing matched| Skip[skipped ✓]
  JobA & JobB & JobC --> Report[PR status checks]
  style Filter fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Path filters fan out from a single change-detection job to per-service jobs that run only when their paths are touched — a push that only changes the API skips the worker and frontend builds entirely.

jobs:
  changes:
    runs-on: ubuntu-latest
    outputs:
      api: ${{ steps.filter.outputs.api }}
      worker: ${{ steps.filter.outputs.worker }}
      frontend: ${{ steps.filter.outputs.frontend }}
      shared: ${{ steps.filter.outputs.shared }}
    steps:
      - uses: actions/checkout@v4
      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            api:
              - 'services/api/**'
              - 'libs/shared/**'
            worker:
              - 'services/worker/**'
              - 'libs/shared/**'
            frontend:
              - 'apps/frontend/**'
            shared:
              - 'libs/shared/**'

  build-api:
    needs: changes
    if: ${{ needs.changes.outputs.api == 'true' }}
    runs-on: ubuntu-latest
    steps: [...]

The changes job runs on every push and emits outputs saying which “modules” had changes. Subsequent jobs consume those outputs to decide whether to run. A push that only touches services/api/ runs only build-api, skipping the rest.

The gotcha in path filters: shared code triggers everything. If libs/shared/** changes, every service that depends on libs/shared should be rebuilt. You encode this in the filter (each service’s filter includes the shared paths), but the moment you forget to add a dependency, you get silent under-testing. The Bazel/Pants/Buck crowd solves this properly: the build system itself computes the dependency graph, so a change to a shared library transitively invalidates exactly the targets that depend on it. Path-filter-based CI is an approximation of the build-graph approach.

Path filters also interact awkwardly with branch-based triggers. On main, you usually want to build everything (to catch drift or cross-service breakage). On PRs, you want only the affected services. The pipeline should differentiate, typically with an if: on the branch.

One more subtlety: required status checks. GitHub lets you mark specific checks as “required” before merging. If a check is required but skipped (because no files matched its filter), the PR can’t merge until that check runs — even though it’s irrelevant. The fix is to emit a “skipped → success” placeholder job for each required check that didn’t actually run. Awkward, but necessary.

113.3 Per-service fan-out and parallelism

Once you know which services changed, you fan out. Each changed service becomes its own job, running in parallel. This is the biggest speedup lever available — rather than running N service builds sequentially in 30 minutes, run them in parallel in 3 minutes.

A GitHub Actions matrix pattern:

  build-services:
    needs: changes
    if: ${{ needs.changes.outputs.any_service == 'true' }}
    runs-on: ubuntu-latest
    strategy:
      matrix:
        service: ${{ fromJSON(needs.changes.outputs.changed_services) }}
      fail-fast: false
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
      - run: make build test SERVICE=${{ matrix.service }}

The changes job emits a JSON array of changed services (["api", "worker"]), the matrix expands to one job per service, and they run in parallel on separate runners. fail-fast: false is important — by default, the first failure cancels the others, but you usually want to see all failures on the first run so you can fix them all at once.

The limits of parallelism. Every runner costs money (and runner capacity). GitHub-hosted runners are on a shared pool with concurrency limits per account — free tier has a few concurrent jobs, paid tiers have more. Self-hosted runners scale as large as you want but require operations. The ceiling is typically 10-50 parallel jobs before the runner pool becomes the bottleneck rather than the code.

Beyond per-service fan-out, per-test-suite fan-out is the next level. Split a long test suite into N shards, run each shard on a separate runner. Test frameworks increasingly support this natively (pytest-split, Go’s -shuffle with shard selection, Jest’s --shard=1/4). A 20-minute test suite split into four shards takes 5 minutes of wall time. The operational cost is that results need to be aggregated at the end, which some CI platforms make easier than others.

113.4 Build caches — the biggest lever

The single biggest determinant of CI speed is whether you’re rebuilding from scratch or reusing a cache. A clean build of a 50-service Go monorepo might take 40 minutes. A fully-cached rebuild of the same thing might take 90 seconds. The difference is whether the CI cache is working.

Types of caches:

Four cache layers operate at different input granularities — the Docker layer cache saves re-downloading base images; the compiler output cache saves re-compiling unchanged code; the test result cache (when available) saves re-running unchanged tests.

Language-native caches. The Go module cache, the npm / pnpm / yarn cache, the pip / uv cache, the Cargo registry cache, the Gradle cache. These cache downloaded dependencies — source or binaries fetched from remote registries. Missing this cache means re-downloading everything on every job.

Compiler output caches. ccache for C/C++, sccache for Rust, Go’s build cache (GOCACHE), Bazel’s remote cache, Nx’s build cache. These cache compiled artifacts, keyed by inputs (source files, compiler flags, dependencies). On a cache hit, the compiler skips the work entirely. This is where the biggest speedups come from.

Docker layer cache. When building container images, each layer is content-addressed. If the layer’s inputs haven’t changed, the cached layer is reused. The pattern: structure Dockerfiles so frequently-changing layers (source code) come after rarely-changing layers (base image, dependencies), so cache hits cover most of the build.

Test result caches. If a test’s inputs haven’t changed, the test’s result from a previous run is still valid. Bazel and Nx do this natively. Jest has experimental support. Most traditional CI pipelines don’t have this and rerun all tests every time.

The canonical cache layout on GitHub Actions:

      - uses: actions/cache@v4
        with:
          path: |
            ~/.cache/go-build
            ~/go/pkg/mod
          key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
          restore-keys: ${{ runner.os }}-go-

The key identifies the exact cache — in this case, keyed by the hash of go.sum. If go.sum hasn’t changed, the cache hits exactly. The restore-keys fallback lets you use an older cache as a starting point even if the exact key doesn’t match, which covers the common case of “dependencies changed a little, but we can still reuse most of the cache.”

The common mistake: caches that are too coarse. A cache keyed by the repo root means any code change invalidates the cache for everything. The fix is per-service caches (keyed by the service’s lockfile) or finer-grained caches (language-native caches, which are usually keyed by individual module hashes).

The other common mistake: caches that are too fine. A cache keyed by the exact commit means every commit misses the cache. The fix is to key by inputs, not by commit — typically the lockfile hash, not the full source hash.

For large monorepos, traditional file-based caching isn’t enough. Bazel with a remote cache (backed by a Redis, GCS bucket, or S3 bucket) lets multiple CI runs share results at the action level. A developer’s local Bazel build pulls cached results from the same bucket, so a full CI run is often the first one to compile anything. This is the endgame of build caching and is what distinguishes a Google / Meta-style build experience from a typical CI.

113.5 Coverage gates and enforcement

Code coverage is a controversial metric. Too low and nobody trusts the tests; too high and engineers write tests just to satisfy the tool, producing useless tests that exercise code without asserting behavior. But coverage gates, used well, are a meaningful signal that a PR added code and added tests for that code.

The standard pattern: patch coverage, not total coverage. Patch coverage measures the percentage of new or changed lines in a PR that are covered by tests. Total coverage measures the percentage of the entire codebase. Patch coverage is the useful one — it answers “did this PR test the code it added?” without penalizing PRs that touch untested legacy code.

Implementation. Run tests with coverage instrumentation (go test -cover, pytest --cov, Jest’s --coverage, Rust’s grcov), collect the coverage data, compute the patch coverage using a tool like diff-cover or a platform like Codecov / Coveralls, and fail the build if patch coverage is below a threshold (typically 70-90%).

      - run: go test -coverprofile=coverage.out ./...
      - uses: codecov/codecov-action@v4
        with:
          files: coverage.out
      - name: Enforce patch coverage
        run: |
          diff-cover coverage.out --compare-branch=origin/main --fail-under=80

The threshold is a judgment call. 80% is a reasonable starting point for most codebases. 100% is unreasonable — every codebase has code that’s hard or pointless to test (generated code, thin wrappers, error handling that would require heavy mocking). Teams that set 100% as a requirement end up writing tests that go through the motions without providing real coverage of behavior.

Gate tuning. Start the gate low (say, 60%) and ratchet it up as the codebase improves. Explicit opt-outs for specific files (generated code, legacy) are fine; tracking them in a config file that requires a code review to modify keeps the list honest.

The biggest pitfall: gates that run but don’t enforce. A status check that’s “posted” but not “required” is advisory, not blocking. PRs merge with red checks because the check isn’t marked required in the branch protection rules. Every few months, someone audits the required checks list and discovers that coverage drifted because the gate stopped actually blocking anything. Audit your required checks quarterly.

113.6 The blocking vs advisory line

Every CI check sits on one side of a line: blocking (the PR cannot merge if this check fails) or advisory (the check runs, the result is visible, but it doesn’t block merge).

What should be blocking:

Unit tests that are fast and reliable.
Compilation / type checks.
Linters that catch real bugs (not just style).
Security scanners (SAST, dependency vulnerability) at high severity.
Coverage gates at conservative thresholds.
Required reviewers.

What should be advisory (at least initially):

Performance regression tests (unless you’ve tuned them to be reliable).
Integration tests against external services (flaky).
Experimental static analyzers.
Style linters that are debatable.
Coverage gates above 80%.

The rule: blocking checks must be deterministic and actionable. If a check is flaky (passes sometimes, fails sometimes, on the same code), making it blocking means people will retry until it passes, which is the worst possible signal. If a check fails with a message like “security issue detected” without a clear action, making it blocking means people will disable it. Advisory is better than a gate that’s routinely ignored.

The arc of a new check: add as advisory, observe for two weeks to see how often it fires and whether the signal is real, tune until it’s reliable, then promote to blocking. Skipping the advisory phase and adding a new check directly as blocking is how teams end up in “CI is broken, everyone retry” purgatory.

A related principle: failures must be actionable within the CI output. A red check with “tests failed” and no link to the actual failure is useless. The output should include the test name, the assertion, the stack trace, and ideally a link to the line of code that failed. Good CI platforms make this easy; custom shell scripts usually don’t. Invest in the output formatting — it pays back every time someone has to debug.

113.7 Ephemeral preview environments

The high-leverage CI feature that most teams never quite get around to: every PR gets its own running environment.

The idea: when a PR opens, CI deploys the code to a temporary environment — its own namespace in a staging cluster, its own database (usually a snapshot of a shared one), its own hostname (something like pr-1234.preview.example.com), and anyone with access can click the link on the PR and interact with the running code. When the PR merges or closes, the environment is torn down.

Why it matters:

Reviewers can actually try the change, not just read the code. Catches bugs that only show up at runtime.
Cross-service changes can be validated end-to-end without waiting for merge-to-staging.
Product and design can see the change before merge, which dramatically tightens the feedback loop.
Frontend teams in particular benefit — a screenshot in a PR is nothing like clicking around the real UI.

The implementation. On PR open or push:

Build the container image as normal.
Generate a unique identifier for the PR (PR number or commit SHA).
Use Helm / Kustomize / CDK8s to template a deployment with that identifier baked in.
Apply the manifests to a preview namespace.
Configure DNS / ingress to expose it at a PR-specific hostname.
Seed a database (or point at a shared staging DB).
Comment the preview URL on the PR.

On PR close:

Delete the namespace.
Clean up any PR-specific DNS records.
Release any per-PR database snapshot.

The gotchas:

Cost. Preview environments multiply cluster resource usage. If every open PR has a full stack running, costs can balloon. Mitigations: TTLs (delete after 48h of inactivity), smaller resource requests for preview pods, not every service needs a preview per PR.
Database seeding. Real data is sensitive; anonymized data loses fidelity; shared staging data means PRs affect each other. No perfect answer.
Secrets. Preview environments need credentials, which means either per-preview secrets (operational overhead) or shared secrets (blast radius).
DNS and TLS. Wildcard certs and DNS automation are required to avoid per-PR manual setup.

Tools that help: Argo Rollouts has preview environment support, Vercel / Netlify do this out of the box for frontends, Okteto and Garden are dedicated preview-environment platforms, and several teams have built their own on top of Helm + cert-manager + wildcard DNS.

If preview environments feel like too much infrastructure to build, start with frontend-only previews. Every frontend PR gets a deployed static site; the backend is shared staging. This covers 80% of the benefit at 20% of the cost.

113.8 The CI platform landscape

The landscape as of 2026:

GitHub Actions. The dominant platform for most teams using GitHub. Strengths: tight integration with GitHub, huge marketplace of actions, YAML-based config, hosted runners for small scale, self-hosted runners for large scale, OIDC-based cloud auth. Weaknesses: workflow files can get sprawling, reusable workflows help but are awkward, some limits (concurrent jobs, runtime) on hosted runners, debugging is limited to log output. For 90% of teams on GitHub, Actions is the right default.

GitLab CI. GitLab’s native CI. Strengths: tight integration with GitLab, container registry built in, review apps (GitLab’s term for preview environments) are a first-class feature, Auto DevOps for turnkey pipelines. Weaknesses: slightly fewer third-party integrations than GitHub, the runner model has more operational overhead. Dominant in teams using self-hosted GitLab.

Buildkite. The “we run the control plane, you run the runners” model. Strengths: excellent scale (teams running tens of thousands of jobs/day), first-class parallelism, good CLI, pipeline-as-code. Weaknesses: more DIY than GitHub Actions, smaller marketplace, requires you to manage the runners. Popular with larger teams that outgrow GitHub Actions at scale.

CircleCI. Strengths: fast, stable, good caching, good container support. Weaknesses: pricing at scale, has lost ground to GitHub Actions over the last few years. Still a reasonable choice for teams that want a third-party CI without DIY.

Jenkins. The old-school option. Strengths: infinitely customizable, huge plugin ecosystem, can do anything. Weaknesses: shows its age, operational overhead is significant, security is a constant concern (outdated plugins, master-agent model). Still prevalent in enterprise shops but rarely the choice for new greenfield.

Tekton. Kubernetes-native CI primitives. Strengths: runs as CRDs in your existing cluster, tight Kubernetes integration. Weaknesses: lower-level than the others, requires more assembly, small user base compared to Actions / GitLab. Good for teams building internal CI platforms on top of Kubernetes.

Dagger. A newer entrant that takes a “CI is code” approach — you write pipelines in TypeScript/Python/Go using the Dagger SDK, which produces portable pipelines that run anywhere. Interesting if you’re frustrated with YAML-based CI config. Small community so far.

For a typical team, GitHub Actions is the default. The tipping points to other tools: GitLab if self-hosting the whole DevOps stack, Buildkite at the 10K-jobs-a-day scale, Tekton for platform teams building their own CI layer, Jenkins only if inherited.

113.9 Flakiness and the retry question

Flaky tests are tests that sometimes pass and sometimes fail on the same commit, with no real change in the code or the system. They are the single biggest threat to a healthy CI pipeline. A 1% flake rate on a single test, across a 100-test suite, means a 63% chance that any given run has at least one flake. At that rate, the team stops trusting CI and starts hitting “retry” reflexively.

Sources of flakiness:

Timing-dependent tests. Tests that assume a timeout is enough or that operations complete in a specific order.
External dependencies. Tests that call real APIs, real databases, real clocks.
Parallelism. Tests that share state and fail when run in a different order.
Nondeterministic setup. Tests that depend on the current time, random data, or environment variables.
Slow operations. Tests that sometimes exceed their own timeouts under CI load.
Resource exhaustion. Runners running out of memory or CPU intermittently.

The wrong answer is automatic retries. It’s tempting — if a test is flaky, retry it, and if the retry passes, move on. CI frameworks often support this natively. Don’t do it. Automatic retries mask the underlying flake, and the flake rate grows unbounded because there’s no pressure to fix it. A year later, your “reliable” test suite has 20% flake rate hidden behind retries, and debugging any real failure requires running the whole suite five times.

The right answer is flake tracking and enforcement. When a test flakes, log it, count the flake rate over time, and block merging new tests that have high flake rates. Quarantine the worst offenders: move them out of the main suite and into a separate “known flakes” suite that doesn’t block merges, with an SLO on getting them fixed. Fix them or delete them on a schedule.

Some CI platforms and test frameworks have built-in flake detection (BuildKite has it, some pytest plugins do it, Datadog CI Visibility tracks it across runs). For teams without tooling, even a simple “track test failures by name in a database and alert when the same test fails intermittently” script is useful.

The principle: the CI signal is the product. If the signal is noisy, nobody trusts it, and the whole investment in CI is wasted. Protecting signal integrity is the job of whoever owns CI, and it requires active maintenance, not just setup.

113.10 Recap of Part IX

This chapter closes Part IX — Build, Deploy, Operate. Step back and look at the arc.

Part IX started with the build system itself (Chapter 101 on Bazel, Pants, Buck, Nx), because monorepos eventually demand more than a language’s native tooling. From there, the stack built upward: containers (Chapter 102) as the unit of deployment, dependency injection patterns (Chapter 103) for cleanly assembling components, API contract design (Chapter 104) for how those components talk to each other, Python tooling (Chapter 105) for the modern stack around uv and ruff, the OCI image lifecycle (Chapter 106) for how images move from build to registry to production, and GitOps philosophy (Chapter 107) for how deployments are driven from git.

The second half of Part IX zoomed out to the operational surround. Helm versus Kustomize versus CDK8s (Chapter 108) for how Kubernetes manifests are authored and composed. Multi-cluster and multi-cell architecture (Chapter 109) for how failure is contained as the system grows. IaC — Terraform, Pulumi, CDK (Chapter 110) — for how the cloud infrastructure underneath all of it is defined. Secrets management (Chapter 111) for the credential story that every layer depends on. Edge ingress (Chapter 112) for how traffic reaches the services. And this chapter, CI as a system, for the pipeline that builds and tests every change before it ships. Thirteen chapters, one system — the full stack from “the developer commits code” to “the request reaches the pod in production.” If a candidate walks into a senior ML systems interview and can speak fluently about this whole arc — build → container → API contract → image → deploy → manifest → cluster topology → IaC → secrets → ingress → CI — that candidate is the strongest person in the room on the production axis. Part IX is done. Part X (starting with Chapter 114) is the interview playbook: how to take every previous chapter and assemble it into a design-interview answer under pressure.

113.11 The mental model

Eight points to take into Chapter 114:

CI is a distributed system. Treat it with the same rigor — determinism, retry semantics, observability, caching.
Path filters are essential for monorepos. The limit is that shared-code changes must transitively trigger dependents; build systems like Bazel do this properly.
Per-service fan-out via matrix jobs is the biggest wall-clock speedup. Runner capacity is the ceiling.
Build caches are the single biggest lever. Language-native + compiler output + Docker layer + test result caches, keyed by inputs not commits.
Patch coverage beats total coverage. Start the gate low, ratchet up. Audit required checks quarterly.
Blocking checks must be deterministic and actionable. Flaky or vague checks drive people to ignore CI.
Ephemeral preview environments are the high-leverage feature most teams skip. Even frontend-only previews pay off.
Flakiness is the enemy. Track, quarantine, fix. Automatic retries are the wrong answer.

In Chapter 114, Part X opens: the ML system design interview playbook. The scoping template, the warm-up round, and how the rest of this book collapses into a 45-minute answer.

Read it yourself

The GitHub Actions documentation, especially reusable workflows, matrix strategies, and caching.
dorny/paths-filter repository and README.
The Bazel remote cache documentation and the Bazel Build Event Protocol.
The Codecov documentation on patch coverage.
Continuous Delivery (Humble and Farley, Addison-Wesley, 2010) — dated in tools, still correct in principles.
Google’s Software Engineering at Google (Winters, Manshreck, Wright, O’Reilly, 2020), chapters on build systems and CI.
The Okteto and Garden documentation for preview environments.

Practice

Write a GitHub Actions workflow that uses dorny/paths-filter to detect changes in three services and run per-service jobs only for the changed ones. Include a branch-based override so main builds everything.
Design the cache keys for a Go monorepo with 20 services. Which caches are per-service, which are shared, and what are the keys?
Explain why patch coverage is a better gate than total coverage. Construct a scenario where total coverage stays flat but patch coverage catches a regression.
A test has a 2% flake rate. After one merge a day for 10 days, what’s the expected number of false failures? Argue for or against automatic retries.
Sketch a preview-environment pipeline for a full-stack app (frontend + API + Postgres). What gets torn down on PR close? How is the database seeded?
Compare GitHub Actions, Buildkite, and Tekton on four axes: ease of setup, scale ceiling, flexibility, operational burden.
Stretch: Take a real repository (open source or your own) and add a per-service fan-out pipeline with cache, patch coverage gate, and a preview environment for one frontend. Measure the wall-clock time before and after.