Chapter 107: GitOps philosophy: ArgoCD, Flux, the App-of-Apps pattern

GitOps is the discipline that Chapter 106 was building toward. The deploy repo is the source of truth; a controller in the cluster watches the repo and makes the cluster match; drift is automatically detected and corrected. This flips the deployment model from “push changes to clusters” to “clusters pull changes from git,” and the consequences are large: audit trails, rollback, multi-cluster fanout, self-healing, and the elimination of kubectl apply from production-access paths.

By the end of this chapter the reader can explain why pull-based deployment beats push-based, configure ArgoCD for a real service, understand the App-of-Apps pattern and when ApplicationSets are the right answer, reason about sync waves and drift detection, and pick between ArgoCD and Flux with clear criteria. The chapter closes the first half of Part IX; Chapters 106-111 pick up with templating, multi-cluster, IaC, and secrets.

Outline:

Pull vs push deployment.
Git as the source of truth.
Drift detection and self-healing.
ArgoCD architecture.
Sync waves and hooks.
The App-of-Apps pattern.
ApplicationSets and multi-cluster fanout.
Flux and how it differs.
Progressive delivery with Argo Rollouts / Flagger.
GitOps anti-patterns.
The mental model.

107.1 Pull vs push deployment

The pre-GitOps world was push-based. A CI pipeline, running in some CI system, held credentials to the target Kubernetes cluster. After building an image, the pipeline would kubectl apply -f deployment.yaml or helm upgrade against the cluster. The CI system had cluster-admin (or close to it); every cluster change was initiated from outside.

The problems with push-based deployment compound as the cluster count grows. The CI system needs credentials to every cluster. Those credentials are long-lived, widely held, and a giant blast-radius risk — a compromised CI pipeline has write access to prod. Multi-cluster fanout means the CI pipeline has to loop over clusters, handle partial failures, coordinate rollouts, and track per-cluster state — none of which CI systems are built for. Drift (someone runs kubectl edit manually and the cluster no longer matches git) is invisible; the next CI run notices but has no record of what was changed.

Pull-based deployment inverts this. A controller lives inside each cluster and watches a git repository. When the controller sees a change, it applies it locally. The cluster pulls its desired state; nothing outside the cluster has write access. The CI pipeline’s job ends at “push a commit to the deploy repo.” Everything after that is the controller’s problem.

The consequences:

No external write access. CI systems do not hold cluster credentials. The attack surface shrinks dramatically.
Multi-cluster is trivial. Every cluster runs its own controller pointed at the repo. Deploying to 20 clusters is the same as deploying to 1; each controller picks up the change independently.
Drift is detected. The controller continuously compares the cluster state to git. If someone runs kubectl edit, the controller notices and either alerts or reverts.
Rollback is git revert. No special tooling; the audit trail is the git log.
The cluster is auditable from git. git log deploy-repo/prod/my-service/ tells you every change that’s touched this service.

This is GitOps in one sentence: the cluster state is a function of the git state, computed by a controller running inside the cluster.

Push-based deployment requires CI to hold cluster credentials and initiate writes; pull-based GitOps inverts this — the controller inside the cluster watches git, so the CI pipeline's job ends at a git push and the blast radius of a compromised CI system shrinks dramatically.

107.2 Git as the source of truth

For GitOps to work, git has to hold the complete desired state of the cluster. Not code, not templates that get resolved somewhere else — the literal YAML (or Helm values, or Kustomize overlays that resolve to YAML) that the cluster should have. The discipline is “if it’s not in git, it doesn’t exist in production.”

The typical deploy repo layout:

deploy-repo/
├── apps/
│   ├── my-service/
│   │   ├── base/
│   │   │   ├── deployment.yaml
│   │   │   ├── service.yaml
│   │   │   └── kustomization.yaml
│   │   └── overlays/
│   │       ├── dev/
│   │       │   └── kustomization.yaml
│   │       ├── staging/
│   │       │   └── kustomization.yaml
│   │       └── prod/
│   │           ├── kustomization.yaml
│   │           └── hpa-patch.yaml
│   └── other-service/
│       └── ...
├── clusters/
│   ├── dev-us-east/
│   │   └── argocd-apps.yaml
│   ├── staging-us-east/
│   │   └── argocd-apps.yaml
│   └── prod-us-east/
│       └── argocd-apps.yaml
└── infrastructure/
    ├── argocd/
    ├── cert-manager/
    ├── external-dns/
    └── prometheus/

apps/ holds per-service manifests with a base/ and overlays/ split (Chapter 108 covers Kustomize). clusters/ holds cluster-specific configuration — which apps should be deployed to which clusters, with what overlays. infrastructure/ holds the platform components that every cluster needs.

The deploy repo is separate from the app repo (Chapter 106, §106.5). This separation matters because the two have different lifecycles: the app repo changes with every code commit, the deploy repo changes with every deployment event (including image-digest updates from the bot). Mixing them leads to noisy repos where every app commit triggers a cluster sync, which is wasteful and bad for rollback.

The rule is strict: no one runs kubectl apply against a GitOps-managed cluster. Every change goes through a PR to the deploy repo. Break-glass exceptions exist (rare, audited, usually documented via a runbook), but the default path is git.

107.3 Drift detection and self-healing

A GitOps controller continuously reconciles. The loop is:

Fetch the git repo. Look at the manifests for the cluster’s assigned apps.
Fetch the live cluster state via the Kubernetes API. List the resources that belong to each app.
Diff. Which resources in git are missing from the cluster? Which resources on the cluster differ from what git says? Which resources exist on the cluster but not in git?
Report the diff. Mark the app as Synced, OutOfSync, or Unknown.
Optionally, auto-sync: apply the diff to make the cluster match git.
Optionally, self-heal: if auto-sync is on and something drifts, the controller reverts the drift on the next reconciliation.

The reconciliation interval is typically 3 minutes by default; configurable down to seconds. On every loop, the controller either notices “no diff, all good” or “drift detected, applying fix” or “drift detected, alerting.” The cluster converges to git continuously.

Drift happens. Someone runs kubectl edit deployment my-service to debug a problem. An HPA changes the replica count (which is drift from the static value in git, and is a classic early GitOps trap — see §107.10). A sidecar injector mutates a pod spec. A Chaos Monkey deletes a pod. The controller has to decide: is this drift I should revert, or drift I should ignore?

The answer is declarative. Argo’s ignoreDifferences field in the Application spec lets you say “ignore the replicas field on this Deployment, because the HPA owns it.” Kustomize’s Never merge semantics do similar for certain annotations. The point is that drift detection is a tunable policy, not a binary.

Self-healing is the most controversial GitOps feature. Turn it on and the cluster aggressively reverts every drift. Turn it off and you have to manually trigger a sync every time. The right setting depends on environment: dev clusters benefit from aggressive self-healing (drift is bugs), prod clusters often benefit from alerting-only (drift might be a human doing emergency debugging, and reverting mid-incident is the wrong move). Teams typically run self-heal on dev/staging and sync-on-commit on prod.

107.4 ArgoCD architecture

ArgoCD is the most popular GitOps controller. It’s a few components running in a namespace on the cluster:

argocd-repo-server: fetches git repos, resolves templates (Helm, Kustomize, Jsonnet), renders the desired manifests.
argocd-application-controller: the reconciliation loop. Compares rendered manifests to live cluster state, applies diffs, manages sync waves.
argocd-server: the API server behind the UI and CLI. Authenticates users, serves the web UI.
argocd-dex-server: optional OIDC/SAML identity broker for SSO.
argocd-redis: cache.
argocd-notifications-controller: fires webhooks on sync events (Slack, PagerDuty).

The key resource is the Application CRD. An Application declares “here’s a git repo, here’s a path in it, here’s a cluster to deploy to, here’s the namespace, here are the sync options.”

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-service-prod
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/deploy-repo.git
    targetRevision: main
    path: apps/my-service/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: my-service
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas

Walking through:

source is “where to find the manifests.” A git repo, a revision, a path. The path points to a Kustomize overlay.
destination is “where to deploy.” The API server URL (https://kubernetes.default.svc means the local cluster) and the target namespace.
syncPolicy.automated.prune: true — delete resources that exist on the cluster but not in git.
selfHeal: true — revert drift automatically.
ignoreDifferences — don’t treat .spec.replicas changes as drift (because HPA manages it).

Argo applies this declaratively. The Application itself lives in the argocd namespace; the managed resources it creates live in my-service. A cluster can have hundreds of Applications, each pointing to a different path in the same repo.

107.5 Sync waves and hooks

When an Application has many resources, Argo applies them all at once by default. Sometimes order matters. A CRD has to be installed before the resource that uses the CRD. A Namespace has to exist before the resources inside it. A migration Job has to run before the new Deployment rolls out.

Argo’s argocd.argoproj.io/sync-wave annotation controls order. Resources with lower waves are applied first; Argo waits for each wave to be healthy before proceeding to the next.

metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"  # CRDs first

metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"   # normal resources (default)

metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "1"   # things that depend on the above

Wave -1 applies before wave 0; wave 1 applies after. Negative waves let you stage dependencies without renumbering everything else.

Sync hooks are the Kubernetes-native equivalent of Helm hooks. An annotation on a Job (or any resource) makes it a PreSync, Sync, PostSync, SyncFail, or PostDelete hook. Argo runs the hook at the corresponding phase of the sync.

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: my-service@sha256:abc...
          command: ["/my-service", "migrate"]
      restartPolicy: Never

This runs my-service migrate as a Job before the main sync. If it fails, the sync is aborted; the main Deployment is not rolled out. This is how you run database migrations safely in a GitOps model. Without hooks, you’d have to run migrations out-of-band, losing the git-as-source-of-truth property.

107.6 The App-of-Apps pattern

The basic model is “one Application per service.” For a cluster with 50 services, that’s 50 Application objects to manage. Who creates them? If they’re all in a file, someone has to kubectl apply them — which means someone has bootstrapping access to the cluster, which is the problem GitOps was supposed to solve.

The App-of-Apps pattern solves this. You create one “root” Application whose source points to a directory of Application YAMLs. Argo reconciles the root, which creates and manages all the other Applications. The root Application is the only one you have to bootstrap manually; everything else is managed by Argo itself.

graph TD
  Root[Root Application<br/>manual bootstrap] -->|manages| A1[my-service App]
  Root -->|manages| A2[billing-service App]
  Root -->|manages| A3[cert-manager App]
  Root -->|manages| A4[prometheus App]
  A1 -->|syncs| D1[my-service Deployment]
  A2 -->|syncs| D2[billing Deployment]
  style Root fill:var(--fig-accent-soft),stroke:var(--fig-accent)

App-of-Apps means one manual kubectl apply bootstraps the cluster: the root Application self-manages every other Application, so adding a new service is a PR to the deploy repo, not a command against the cluster.

# root-app.yaml — the only thing applied manually to the cluster
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/deploy-repo.git
    targetRevision: main
    path: clusters/prod-us-east/apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

The path clusters/prod-us-east/apps contains Application YAMLs for every service that should run in this cluster:

clusters/prod-us-east/apps/
├── my-service.yaml
├── billing-service.yaml
├── payments-service.yaml
├── cert-manager.yaml
├── external-dns.yaml
└── prometheus.yaml

Adding a service to the cluster is: PR that adds an Application YAML to the apps/ directory. Argo’s root sync picks it up, creates the Application, which triggers its own sync, which deploys the service. Removing a service is the reverse: delete the Application YAML; the root sync removes the Application; prune: true deletes the resources it managed.

The bootstrap is: after installing ArgoCD itself (via a Helm chart or manifest), apply the root Application once. That’s the one manual step. Everything else is git.

107.7 ApplicationSets and multi-cluster fanout

App-of-Apps works for one cluster. For 20 clusters, you’d need 20 nearly identical Application YAMLs differing only in the cluster name. That’s templating pain. The ApplicationSet CRD is Argo’s answer: a single resource that generates many Applications based on a generator.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-service
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: prod
  template:
    metadata:
      name: 'my-service-{{name}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/my-org/deploy-repo.git
        targetRevision: main
        path: 'apps/my-service/overlays/{{name}}'
      destination:
        server: '{{server}}'
        namespace: my-service
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

The clusters generator looks at registered Argo clusters with label env: prod. For each matching cluster, Argo generates one Application from the template, substituting {{name}} and {{server}}. Adding a new cluster with the prod label automatically generates an Application for it; removing the cluster removes the Application.

Other generators exist:

List: static list of parameters. Useful for “deploy to these three clusters.”
Git: generate from files or directories in a git repo. Point at apps/my-service/overlays/ and generate one Application per subdirectory (one per environment).
Matrix: Cartesian product of two generators. “For each environment × each service, generate an Application.”
SCM: enumerate repositories/branches in a GitHub org. Less common.
Pull Request: generate an Application per open PR. Perfect for per-PR preview environments.

ApplicationSets are where GitOps starts to feel magical. “Deploy every service in the org to every dev cluster” is a single ApplicationSet with a matrix generator. Onboarding a new service is one PR that adds a path to the repo; onboarding a new cluster is one cluster registration. The fanout is automatic.

107.8 Flux and how it differs

Flux is the other mainstream GitOps controller, also CNCF graduated. The conceptual model is similar — pull from git, reconcile to cluster — but the details differ in ways that matter.

Flux is built from smaller, composable controllers. Each controller handles one thing: the source-controller fetches git repos and Helm charts; the kustomize-controller reconciles Kustomize overlays; the helm-controller reconciles HelmReleases; the notification-controller sends alerts; the image-automation-controller updates image tags in git when new images are pushed. You install the controllers you need.

The CRDs are correspondingly granular. A GitRepository points at a git URL. A Kustomization points at a path in a GitRepository and reconciles it. A HelmRelease points at a chart. The user-facing model is a bit more verbose than ArgoCD’s Application but composes better — you can mix and match sources, paths, and reconciliation policies freely.

ArgoCD has a polished UI; Flux does not (there’s a separate weave-gitops UI that adds one). ArgoCD’s CLI and web UI are the main user interfaces; Flux is more CLI-and-manifest-centric. For teams that want a clickable dashboard of every app across every cluster, ArgoCD is the easier sell. For teams that want a minimal controller set that integrates cleanly with other Kubernetes operators, Flux is cleaner.

Flux’s image automation is a differentiator: built-in support for watching a registry for new tags and opening a PR against the deploy repo to update the manifest. This replaces the bespoke bot described in Chapter 106 for many teams. ArgoCD has a similar capability (Argo CD Image Updater) but it’s an add-on, not a core component.

For a new deployment in 2026, both are good choices. The honest heuristic: ArgoCD if you want the UI and the larger ecosystem; Flux if you want the more modular architecture and tight image automation. Don’t mix them in the same cluster — they step on each other and the debugging is painful.

107.9 Progressive delivery with Argo Rollouts / Flagger

Once GitOps is wired up, the next question is: how do you deploy safely? A naive kubectl rollout replaces pods in rolling order. For a low-risk change this is fine; for a model-upgrade or a new feature with uncertain behavior, you want progressive delivery — shift a small fraction of traffic to the new version, observe metrics, proceed or roll back.

Argo Rollouts is Argo’s progressive delivery controller. It replaces the Kubernetes Deployment kind with a Rollout kind that supports canary and blue/green strategies, driven by metrics from Prometheus or other sources.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10       # 10% traffic to new version
        - pause: {duration: 5m}
        - analysis:           # run a metric analysis
            templates:
              - templateName: error-rate
        - setWeight: 30
        - pause: {duration: 10m}
        - analysis: {templates: [{templateName: error-rate}]}
        - setWeight: 60
        - pause: {duration: 10m}
        - setWeight: 100
  template:
    # ... same as a Deployment pod template

The AnalysisTemplate defines a PromQL query and a success condition (“error rate below 1% for 5 minutes”). If analysis fails at any step, the rollout automatically aborts and rolls back to the previous version. No human intervention, no pager, no 3 AM rollback call.

Flagger is the Flux equivalent, with similar capabilities. Both work by manipulating the underlying service mesh or ingress to shift traffic between old and new pods.

Progressive delivery matters most for ML model releases (Chapter 98’s canary patterns). A new model deploying to 5% of traffic with auto-rollback on latency spikes is dramatically safer than a full rollout. Chapter 98 went deep on the ML-specific side; Argo Rollouts and Flagger are the platform primitives underneath.

107.10 GitOps anti-patterns

A few failure modes worth naming explicitly.

Secrets in git. Secrets cannot be in git, even encrypted ones, if the threat model includes “someone reads the git repo.” The solutions are sealed-secrets (encrypt-at-rest with a cluster-specific key), External Secrets Operator (fetch from Vault/1Password/AWS Secrets Manager at sync time), or SOPS (encrypt inline with a KMS key). Chapter 111 covers this in depth. Never put plaintext secrets in a GitOps repo, even a private one.

Replica-count in git fighting the HPA. replicas: 3 in git and an HPA that wants 20 means Argo constantly reverts the HPA. The fix is ignoreDifferences on the .spec.replicas field, or removing replicas from the manifest entirely (let the HPA manage it exclusively). This is the single most common early GitOps bug.

Over-use of self-healing in prod. Aggressively reverting drift during an incident is the wrong move; sometimes drift is a human doing emergency debugging. Put self-heal in dev, sync-on-commit in prod. Alert on drift in prod; don’t auto-revert.

Gigantic repo. One deploy repo for 500 services becomes a bottleneck for CI, diffs, and mental model. Split into per-team repos with a top-level meta-repo if the scale warrants it. Or keep one repo but use CODEOWNERS to enforce per-team review paths.

Templates-of-templates-of-templates. Kustomize overlays on Helm charts on Jsonnet templates produces a situation where nobody knows what actually gets applied. Pick one templating tool per repo and stick with it. Chapter 108 covers the tradeoffs.

Bypassing GitOps with kubectl apply. A runbook that says “run kubectl edit deployment” is admitting defeat. Every change should be a PR. If a change is too urgent for a PR, fix the PR pipeline to be faster; don’t route around it.

No rollback plan. GitOps rollback is git revert, but some changes aren’t safely revertible (database schema changes, irreversible migrations). Mark those changes, plan them carefully, and understand that “rollback via git” is not a universal escape hatch.

107.11 The mental model

Eight points to take into Chapter 108:

Pull-based deployment beats push-based. No credentials leave the cluster; the CI system ends at git-push.
Git is the source of truth. Everything is a PR; nothing is kubectl apply in production.
A controller reconciles cluster state to git continuously. Drift is detected and (optionally) healed.
ArgoCD’s Application is the unit. App-of-Apps bootstraps an entire cluster from one root Application.
ApplicationSets fan out to many clusters or environments. Generators declare the matrix.
Sync waves and hooks order dependencies, run migrations as PreSync jobs.
Argo Rollouts / Flagger add progressive delivery on top of GitOps. Metric-driven canaries and auto-rollback.
Anti-patterns matter. Secrets in git, replica-count conflicts, self-heal in prod, and kubectl apply break-glass are the landmines.

Chapter 108 picks up the templating question that this chapter keeps touching — Helm, Kustomize, CDK8s — and compares them honestly. The second half of Part IX (106-111) is about multi-cluster architecture, IaC, and secrets management.

Read it yourself

Argo CD documentation at argo-cd.readthedocs.io. Start with “Getting Started” and “Declarative Setup.”
Flux documentation at fluxcd.io/flux/. The “Concepts” section is the clearest intro to the controller architecture.
Weaveworks’ Guide to GitOps (2017 blog post, rewritten several times). The origin of the term and the core principles.
The OpenGitOps principles at opengitops.dev. A short formal definition of what GitOps means as of the CNCF working group.
Viktor Farcic, The DevOps Toolkit: Catalog, Patterns, and Blueprints, chapters on GitOps. Practical recipes.
Argo Rollouts documentation, particularly “Canary strategy” and “Analysis.” Concrete examples of progressive delivery configs.

Practice

Bootstrap a local k3d cluster with ArgoCD. Apply an App-of-Apps root Application pointing at a public example repo. Observe the reconciliation in the UI.
Create an Application for a trivial service (nginx pointed at a ConfigMap-mounted HTML page). Make a change in git and watch Argo sync it.
Explain in your own words why secrets cannot be in a GitOps repo, even if the repo is private.
Write an Application spec that deploys the same service to three environments (dev, staging, prod) using three Kustomize overlays. Use an ApplicationSet with a git generator.
Configure an Argo Rollout with a 25% → 50% → 100% canary and a PromQL analysis that checks error rate. What happens if the analysis fails mid-rollout?
A team wants kubectl edit in prod for emergencies. How would you support that without destroying the GitOps model?
Stretch: Simulate drift in a GitOps-managed cluster by manually editing a Deployment with kubectl edit. Watch what Argo does with self-heal on vs off. Write a short postmortem-style writeup of what you observed.