Chapter 75: Identity propagation across service boundaries

Chapter 74 treated a request as a pairwise interaction between a client and a server. Real production systems are rarely that simple. A user’s request lands on a gateway, which calls an orchestrator, which calls a model service, which calls a vector store, which calls a feature store. Five hops. At each hop, the receiving service has to answer two questions: which service is calling me, and on whose behalf?

This chapter is about the second question. The first question is Chapter 74’s territory (mTLS, client credentials, service identities). The second is harder because the user was only present at hop zero; their identity has to propagate through the graph in a way that every downstream service can trust without the user ever touching it directly. Getting this wrong leaks data across tenants, lets one user act as another, or forces every service to call back to the identity provider. Getting it right is one of the quiet arts of senior platform engineering.

Outline:

Why identity propagation is a problem.
The naive approach and why it fails.
Token passthrough.
Token exchange.
Internal JWTs and the issuer/audience claim.
The “trust upstream” pattern and its dangers.
Principal context propagation across RPC hops.
Multi-identity requests: user + service + tenant.
Audit trails and forensics.
Failure modes.
The mental model.

75.1 Why identity propagation is a problem

Consider a concrete flow in an ML platform:

user → gateway → assistant-svc → llm-router → vllm-pod → vector-db
                             └─→ retrieval-svc ──────────────┘

Each hop mints a fresh internal token scoped to the next audience; a compromised service leaks only its own narrow token, not the user's original credential.

At each hop, the downstream service has to enforce policy: the vector store has to scope queries to the user’s tenant, the LLM router has to apply the user’s rate limit and billing account, the assistant has to log actions under the user’s audit identity. If the user’s identity stops at the gateway and every downstream service sees only “a call from assistant-svc,” then policy enforcement collapses: either assistant-svc becomes a god-mode service that sees everything (and every bug is a privilege escalation), or downstream services have no way to distinguish one user from another.

The naive fix — “just forward the user’s original token to every downstream service” — has three failure modes that make it unsafe in practice. The right answer is a combination of token exchange, audience-scoped internal tokens, and principal context propagated through RPC metadata. This chapter walks each.

The deeper reason identity propagation is tricky: trust is not transitive by default. Service A can authenticate the user. Service B, downstream of A, cannot assume “because A verified the user, I can trust A when it tells me who the user is” — unless there is a cryptographic reason to do so. The reason has to be constructed explicitly. The techniques in this chapter are all ways to construct that reason.

75.2 The naive approach and why it fails

The simplest approach: the user’s JWT is forwarded as-is to every downstream service. Every service verifies it. The identity flows for free.

The failures:

Audience mismatch. The user’s token was minted with aud: "gateway.example.com". Service B is not the audience. If service B accepts the token anyway, it is vulnerable to a confused deputy attack: a token meant for one service is used against another. If service B rejects the token (the correct behavior), the naive forwarding does not work.

Scope bloat. The user’s token has scopes intended for the public API — chat:write, embeddings:read. Downstream services may need different scopes, or no scopes at all. The user’s token is either over-scoped for the downstream (giving too much access) or under-scoped (missing permissions the service needs for internal work). You cannot have one token that is right for every layer.

Token leakage. Every service the request touches sees the raw user token. If any of them logs request headers or has a debug endpoint that echoes them, the token leaks. One compromised service steals tokens for every user who ever touched it. The blast radius is catastrophic.

Refresh and rotation. The user’s token expires in 5 minutes. A long-running background job kicked off by the user’s request tries to call downstream 10 minutes later. The token is expired. Naive forwarding has no recovery.

No service identity. Downstream service B sees the user’s token but has no way to know that the call came via service A versus via a malicious direct call. No way to apply policies like “service B only accepts user tokens via service A,” because the request looks identical either way.

These failures are why production systems always add a layer of indirection: they exchange the user’s token for an internal token, or they wrap the user’s identity in a service-signed envelope. The next three sections are those patterns.

75.3 Token passthrough

The simplest mitigation is token passthrough with scoping. The user’s original token is forwarded, but downstream services verify it strictly: signature, expiration, and crucially the audience. The authorization server mints tokens with an aud that is a list of intended recipients (e.g., aud: ["gateway", "assistant-svc"]), and each service validates that its own identifier is in the list.

This is the pattern used by many smaller systems. It avoids the complexity of token exchange but inherits most of the failures in §75.2 — token leakage especially. The scoping helps, but the root token is still everywhere.

A partial mitigation: pass the token only to services that must see the user’s raw authorization (e.g., a BFF that makes user-scoped calls to external APIs). Further-downstream services see only an internal representation of the identity, not the original token. This shrinks the blast radius.

When passthrough is acceptable:

Small platforms with few services.
Systems where the user’s token is short-lived (minutes) and the call chain is short.
Cases where downstream services already need the user’s token for external API calls and cannot derive it.

When it is not acceptable:

Any platform with more than a handful of services.
Long-running background work kicked off by a user request.
Systems where one downstream service has a different security posture than the rest (e.g., a sandbox, a third-party integration).
Anywhere the blast radius of a single compromised service matters.

For those cases, the answer is token exchange.

75.4 Token exchange

RFC 8693: OAuth 2.0 Token Exchange is the standard. The idea: a service holding a subject token (the user’s original) can call the authorization server’s /token endpoint with grant_type=urn:ietf:params:oauth:grant-type:token-exchange and get back a new token that represents the same subject but is scoped for a specific downstream audience with a specific set of scopes and a shorter lifetime.

sequenceDiagram
  participant Client
  participant Gateway
  participant AuthServer
  participant LLMRouter
  Client->>Gateway: POST /v1/chat (ext_token)
  Gateway->>AuthServer: POST /oauth2/token (exchange ext_token → aud=llm-router)
  AuthServer-->>Gateway: int_token {sub=user_42, aud=llm-router, scopes=[chat:write], exp=+5min}
  Gateway->>LLMRouter: forward (int_token)
  LLMRouter->>LLMRouter: verify sig, aud, exp
  LLMRouter-->>Gateway: response
  Gateway-->>Client: response

Token exchange (RFC 8693) adds one round-trip to the auth server per outbound audience; caching exchanged tokens for their lifetime keeps this overhead to milliseconds.

The concrete request (simplified):

POST /oauth2/token
Content-Type: application/x-www-form-urlencoded
Authorization: Basic <base64(client_id:client_secret)>

grant_type=urn:ietf:params:oauth:grant-type:token-exchange
&subject_token=<user's original token>
&subject_token_type=urn:ietf:params:oauth:token-type:jwt
&audience=llm-router
&scope=chat:write
&requested_token_type=urn:ietf:params:oauth:token-type:access_token

The response is a new JWT whose claims represent the user (same sub) but whose audience is llm-router and whose scope is tightly restricted. The gateway or service that performs the exchange is the actor — the new token includes an act claim that records who did the exchange on behalf of whom, so audit trails can reconstruct the chain.

The benefit: every downstream service gets a token that is just for it, with just the scopes it needs, with a short lifetime that expires quickly if intercepted. The user’s original token never leaves the gateway. A compromised downstream service leaks only its own narrow token, not the user’s god-token.

The cost: every hop that requires a new audience is a call to the auth server. This is the classic latency/security tradeoff. Mitigations:

Exchange once at the edge. The gateway exchanges the user’s token for an internal token on ingress, and the internal token is used for all downstream calls. Only one extra round trip per request.
Cache exchanged tokens. If the same subject and audience pair is seen repeatedly, cache the result until near expiration. Short cache TTLs (seconds to a minute) strike the right balance.
Use shorter call chains. Flattening the architecture reduces the number of exchanges.
Batch exchanges. Some auth servers accept batch exchanges in one call.

Token exchange is the “real” answer for any platform beyond a handful of services. Every major cloud IDP (Auth0, Okta, Keycloak) supports it. If yours does not, add it before the architecture grows further.

75.5 Internal JWTs and the issuer/audience claim

A variant of token exchange: the gateway mints an internal JWT directly, without involving an external auth server. The gateway has its own signing key (or a mesh-provided identity) and acts as the issuer for internal tokens. The internal JWT’s claims:

{
  "iss": "gateway.internal",
  "aud": "llm-router",
  "sub": "user_42",
  "tenant": "acme",
  "act": {"svc": "gateway", "svc_ver": "1.18.3"},
  "scopes": ["chat:write"],
  "exp": 1728000000,
  "iat": 1727999700,
  "jti": "f8b2..."
}

The internal JWT has all the properties of a regular JWT: signed, audience-scoped, short-lived. But it is minted locally without a round trip to an external auth server, which makes it cheap enough to mint per-hop. A deeper chain can mint a new internal token at each hop, each with the next service’s audience, chaining the act claim to record the call path.

Key rules:

Internal JWTs must have a different issuer than user-facing tokens. Downstream services must reject a user token as an internal token and vice versa, because they have different trust assumptions.
The audience is mandatory. Every internal JWT is for exactly one target service. No wildcards.
The signing key is distinct from the user-facing auth server’s key. Compromising the internal signing key should not let an attacker mint user-facing tokens.
Short lifetimes. Minutes at most. These tokens live inside the platform and never need to outlive a single request chain.
The act claim chains. Each hop that re-mints an internal token nests the previous act inside the new one, building a linked list of delegations. This is how audit logs reconstruct “who called whom on behalf of whom.”

Internal JWTs are the bread and butter of large microservice platforms because they are cheap, local, verifiable offline, and cleanly auditable. The downside is that you are now running your own mini-PKI: key rotation, JWKS publication, signature algorithm upgrades. If your mesh already provides service identities (SPIFFE), you can often get away with using SVIDs (SPIFFE Verifiable Identity Documents) as the signing identity and skipping a separate key store.

75.6 The “trust upstream” pattern and its dangers

Some platforms take a shortcut: instead of verifying JWTs at every hop, downstream services simply trust that any call arriving via an internal network already went through the gateway and that headers like X-User-Id and X-Tenant are authoritative. The gateway sets these headers after verifying the user’s token; downstream services consume them directly.

This pattern is common because it is fast (no signature verification) and simple (just read a header). It also has real failure modes.

Perimeter erosion. The assumption “calls inside the cluster came through the gateway” is only true if the network is fully segmented. One service exposed by mistake — a debug port, an Ingress pointing to the wrong target, a pod with hostNetwork — and an attacker can call downstream services directly with arbitrary X-User-Id headers and impersonate anyone. Zero-trust networking exists because perimeter trust collapses under any mistake.

Compromised sidecars. A compromised pod inside the mesh can set any headers it wants. mTLS only tells you which pod is calling; it does not tell you what the pod’s headers mean. If the headers are trusted by policy, any pod can act as any user.

No audit integrity. With trust-upstream headers, there is no cryptographic proof that the gateway actually set the header. Audit logs become unverifiable after the fact.

The pattern is acceptable under strict conditions:

The gateway must strip and overwrite every identity header before forwarding; a single missing strip turns a valid API key into admin impersonation.

mTLS is enforced between every service. Downstream services only accept calls from services whose mTLS identity is on an allowlist (e.g., only gateway.svc.cluster.local).
The gateway is the only entrypoint, verified by network policies (NetworkPolicy in Kubernetes, or the equivalent).
There is no back door — no debug endpoints, no admin interfaces bypassing the gateway.
Headers are stripped on ingress. The gateway must strip any incoming X-User-Id header from the client and only set its own version. Missing this is a classic bug: a malicious client sends X-User-Id: admin and the gateway forwards it.

Even with all four, trust-upstream is brittle. The pattern survives because it is fast and because in many platforms the blast radius of a mistake is bounded. The right long-term move is to upgrade to signed internal JWTs (§75.5), which give you both speed (local verification) and cryptographic integrity.

When interviewers ask about this pattern, the right answer is: “It can work under strict network and enforcement assumptions, but it is a trust-propagation shortcut that collapses when any of those assumptions break. Signed internal tokens are the safer default.”

75.7 Principal context propagation across RPC hops

The identity is in a JWT; how does the handler code see it? Two mechanisms, used together.

HTTP/gRPC headers carry the token on the wire. Authorization: Bearer <token> for HTTP, metadata for gRPC. Every hop passes them through; every hop verifies them.

Request-scoped context carries the parsed identity inside the handling service. In Go, context.Context is the idiom; in Python, contextvars; in Java, thread-locals or the Spring SecurityContext; in Node, AsyncLocalStorage. The middleware at request ingress verifies the token, parses the claims, builds a Principal struct, and stores it in the context. Handler code retrieves it via ctx.Principal() rather than re-parsing the header.

# Pseudocode for a Python/FastAPI-style middleware
async def auth_middleware(request, call_next):
    token = request.headers.get("authorization", "").removeprefix("Bearer ")
    claims = verify_jwt(token, expected_audience="my-service")
    principal = Principal(
        subject=claims["sub"],
        tenant=claims.get("tenant"),
        scopes=claims.get("scopes", []),
        act=claims.get("act"),
        raw_token=token,  # keep for downstream exchange
    )
    request.state.principal = principal
    return await call_next(request)

Handler code:

async def chat_endpoint(request: Request):
    p = request.state.principal
    if "chat:write" not in p.scopes:
        raise HTTPException(403)
    # Call downstream with a new audience
    downstream_token = await token_exchange(
        subject_token=p.raw_token,
        audience="llm-router",
        scopes=["chat:write"],
    )
    return await llm_router_client.generate(
        request_body,
        headers={"Authorization": f"Bearer {downstream_token}"},
    )

Two invariants matter:

Never trust claims from outside the middleware. Handlers never parse raw tokens; they only read the Principal. This is enforced by code review and, better, by linters that forbid reading Authorization headers outside the middleware file.
Propagate deadline, trace context, and principal in the same context object. All three flow through every call. OpenTelemetry’s propagation spec handles the trace context; language conventions handle the rest.

The principal also includes the actor chain — who is calling on behalf of whom. When a background worker kicks off work for user 42, the actor is the worker, the subject is user 42, and policies can decide whether “worker-svc calling as user 42” is acceptable for the action in question. Services like Google’s Zanzibar distinguish these explicitly in their ACL model. A platform that flattens them loses the ability to write policies like “only workflow-svc can act as a user for long-running jobs.”

75.8 Multi-identity requests: user + service + tenant

A single request carries multiple identities at once:

User — the human (or API key) that initiated the call.
Service — the calling service (known via mTLS / SPIFFE).
Tenant — the organization the user belongs to, which scopes data access.
Actor — if the request is delegated (e.g., a worker running on behalf of a user), the actor is the service that initiated the delegation.

Each plays a distinct role in authorization decisions. A naive system flattens these (“the user is user 42”) and loses the ability to write policies like “user 42 can only read tenant A’s data when called through the dashboard, not through the CLI.” The full request context looks like:

Principal {
  subject: "user_42"          # who
  tenant: "acme"              # scope
  caller: "gateway.prod"      # which service forwarded this call (from mTLS)
  actor: {svc: "gateway"}     # who delegated (from JWT act claim)
  scopes: ["chat:write"]      # what they can do
}

Authorization checks can then use any combination: allow if subject in tenant.members AND caller == "gateway.prod" AND "chat:write" in scopes. Policies live in a PDP (Chapter 74) that understands all four identities.

Tenant scoping is the one that is easiest to get wrong in multi-tenant ML platforms. Every database query, every vector search, every cache lookup must be scoped by tenant. The enforcement should be at the data layer, not at the handler layer, because handler-layer enforcement is one if statement away from being forgotten. Row-level security in Postgres, tenant-prefixed keys in Redis, per-tenant vector collections in Qdrant/Weaviate — pick the mechanism that cannot be forgotten. If a developer writes SELECT * FROM documents WHERE id = ? without a tenant filter, the bug should fail closed, not open.

75.9 Audit trails and forensics

Identity propagation pays off in an incident. Someone reports a data leak. The question is: which user, via which services, accessed what resource, and when? The answer requires two things.

Every AuthZ decision is logged. Timestamp, subject, actor chain, caller service, action, resource, allow/deny, reason. These logs go to a dedicated store (typically an append-only log like Kafka into an OLAP store, not the general application logs). They are retained for the period required by compliance (90 days minimum, often years).

The chain is reconstructible. Given one log entry, you can follow it backwards through the service graph. This requires trace IDs that propagate along with identity (OpenTelemetry), and it requires that every service emits an audit log entry keyed by the same trace ID. A gateway audit log entry, an assistant-svc audit log entry, an llm-router audit log entry, and a vector-db audit log entry all with the same trace ID tell the full story.

The act claim from §75.5 gives you the intended delegation chain at token-mint time. The trace ID gives you the actual call path at runtime. Both are useful: when they diverge, you have a bug. A request that was supposed to be scoped to assistant-svc → llm-router but actually reached vector-db-admin is suspicious even if no policy was violated.

Build this before you need it. After an incident is a bad time to discover that your audit logs do not have enough context to answer questions.

75.10 Failure modes

A tour of the ways identity propagation goes wrong in practice.

JWT passed forward with no audience check. Service B accepts service A’s token because “we trust the cluster.” One misconfigured service is enough to impersonate any user. Always validate aud.

Expired token triggers chained re-login. A user’s 5-minute token expires while their long-running request is still traversing backends. Downstream calls start failing with 401. The user sees an error for a request they already submitted. Fix: token exchange at the edge for background work, with longer-lived internal tokens.

User’s token leaked in logs. A service logs request.headers for debugging. Every user’s Authorization header ends up in Splunk. Fix: never log raw headers. Sanitize in the logging middleware.

act claim dropped. The first service propagates the identity; subsequent services re-mint without the act chain. The audit trail loses fidelity. Fix: mandatory act propagation in the exchange helper, verified by tests.

Trust-upstream headers not stripped. A client sends X-User-Id: admin, the gateway passes it through, downstream trusts it. Catastrophic. Fix: gateway must strip any inbound X-User-* headers before setting its own.

Per-hop exchange too slow. Every hop calls the auth server, adding 20 ms. A 5-hop request is now 100 ms of auth overhead. Fix: exchange once at the edge, use internal JWTs for further hops.

Tenant claim missing on service-initiated calls. A scheduled job calls vector-db without a tenant context. The query returns all tenants’ data. Fix: even service-initiated calls must carry an explicit tenant context, either in the token or as a deliberate “cross-tenant” claim with its own audit signal.

JWKS cache hot-reload fails. The auth server rotates keys. Services cache the old JWKS for hours. Verification starts failing for new tokens. Fix: cache with a TTL, refresh on unknown kid, warn on stale JWKS.

Scope creep in internal tokens. Internal tokens accrete scopes over time as new features want them. Eventually internal-full-access becomes the norm. Fix: audit scope grants in PRs; require justification for anything beyond least-privilege.

Failure to propagate revocation. A user is disabled. Existing tokens still work. Short access lifetimes help. For long-lived background work, store a version number in the token and check it against a fast cache.

75.11 The mental model

Eight points to take into Chapter 76:

Identity must propagate through the call graph, not stop at the gateway. Downstream services need the original identity to enforce tenant-scoped policy.
Naive passthrough fails on audience, scope, token leakage, and refresh. Do not forward user tokens blindly across services.
Token exchange (RFC 8693) is the standards-based answer: mint a new token for each audience with just the scopes needed.
Internal JWTs are the local optimization of token exchange: the gateway signs a new short-lived token per-hop with a different issuer than user-facing tokens.
The act claim records delegation chains: “service A acting on behalf of user 42.” Needed for correct audit trails.
Trust-upstream headers are fast and brittle. Acceptable only with strict mTLS, network segmentation, and header stripping — and even then, signed internal JWTs are safer.
A request carries multiple identities: user, tenant, calling service, actor. Authorization policies use all four.
Audit logs must reconstruct the chain. Every authZ decision, keyed by trace ID and including the full actor chain, stored durably before you need it.

In Chapter 76 the focus shifts from who is calling to how much they can call: rate limiting algorithms, from token bucket to GCRA, with a real Redis Lua script you would ship.

Read it yourself

RFC 8693 — OAuth 2.0 Token Exchange. The standard for exchanging one token for another.
SPIFFE specification (spiffe.io), particularly SVIDs and the workload API.
Google’s BeyondCorp papers (and follow-ups) — the zero-trust model that argues against perimeter trust.
Envoy Proxy’s external authorization filter documentation.
The OPA documentation on policies that use identity attributes.
Scott Piper’s “flaws.cloud” exercises — real-world examples of AWS misconfigurations that boil down to identity propagation bugs.

Practice

Draw the identity propagation flow for a 4-service call chain. At each hop, mark which token is in the Authorization header and which identity is in the handler’s context.
Explain why forwarding the user’s raw JWT to every downstream service is unsafe. Give at least three distinct failure modes.
Write the claims for an internal JWT minted by a gateway for a call to llm-router on behalf of user_42 in tenant acme with scope chat:write. Include a plausible act chain.
Why is the aud claim mandatory for internal tokens? Construct a confused-deputy attack that would succeed without it.
The trust-upstream pattern is fast. Give three assumptions that must hold for it to be safe, and describe what happens when each one breaks.
A scheduled batch job runs at 2 AM, processing queued work submitted by user 42 at 5 PM. The user’s access token lived 5 minutes. How does the batch job prove it is acting on behalf of user 42 eight hours later? Sketch a design.
Stretch: Build a toy 3-service chain (gateway, middle service, leaf) with local JWT signing and verification. Implement token exchange between the gateway and the leaf. Verify that dropping the aud check on the leaf allows a confused-deputy attack; with the check, it is prevented.