Chapter 73: The unified gateway pattern: one API in front of many backends

Part V ended with agents orchestrating tools and other agents. That orchestration happened one layer above the request fabric. Part VI drops down to the fabric itself — the plumbing that carries every request from a client’s socket all the way into a model replica and back. The entry point to that plumbing, the first server that terminates TLS for an incoming request, is the API gateway.

The gateway is almost always the most loaded, most instrumented, most fought-over service in a production ML platform. It is where authentication happens, where rate limiting happens, where routing happens, where tenancy is enforced, where telemetry is emitted, and where backward compatibility is promised to the outside world even when the backend is being rewritten underneath. Getting the gateway right is the difference between a platform where a backend change is a one-line route update and a platform where every backend change breaks three teams.

Outline:

Why API gateways exist.
Gateways vs service meshes vs load balancers.
Path, host, and header routing.
The BFF (backend-for-frontend) pattern.
Request aggregation and response stitching.
Fan-out and scatter/gather.
Cross-cutting concerns: auth, rate limiting, telemetry.
The gateway as a contract boundary.
Why ML platforms always have a gateway.
Failure modes and anti-patterns.
The mental model.

73.1 Why API gateways exist

A backend with one service does not need a gateway. A backend with forty services does. The reason is not any single feature — it is that the set of concerns shared by every service (auth, rate limiting, logging, TLS termination, CORS, quota enforcement, request ID propagation, user identity propagation) grows with the service count, and implementing all of them inside every service means forty copies of the same code drifting out of sync.

The gateway centralizes those concerns. One server — or one fleet of identical servers — handles them for every inbound request. Services behind the gateway can focus on their own logic and assume that by the time a request reaches them, it has been authenticated, its rate limit has been checked, its identity has been established, and its telemetry context has been initialized. The gateway is the common runway for every request.

Every cross-cutting concern lives in the gateway once; backends inherit them without writing a line of auth or rate-limit code.

The second reason is decoupling the public API from the internal topology. Clients do not want to know that /v1/chat/completions is served by one service and /v1/embeddings is served by another, each written in a different language and running on a different cluster. They want one URL, one auth story, one set of SDKs. The gateway lets the team rearrange the internal graph freely — split a service, merge two, move one to a new cluster — without changing anything clients see. That freedom is not a nice-to-have; it is the only way a platform with multiple teams stays shippable.

The third reason is enforcement. Rate limits, quotas, billing signals, denylist checks — if any of these are implemented only in the services, a single buggy service that forgets to check them leaks the entire platform. At the gateway, enforcement is a single line of code. Miss it once, everything is wrong; get it right, everything is protected.

73.2 Gateways vs service meshes vs load balancers

Three kinds of infrastructure look similar to a junior engineer and behave very differently.

Each layer has exactly one job; collapsing them into one fleet produces a Frankenstein configuration nobody fully understands.

A load balancer (L4 or L7 — think AWS NLB, HAProxy in its simple mode, a GCP forwarding rule) distributes TCP connections or HTTP requests across a pool of identical backends. It does one thing: pick a backend and forward. It does not know about auth, rate limits, or anything about the application.

An API gateway (Kong, Envoy as a gateway, AWS API Gateway, Apigee, Tyk, NGINX in gateway mode) sits in front of many different backends and applies per-route policy: auth, rate limit, routing, request/response transformation, aggregation. It is application-aware. One gateway fronts many services; a load balancer fronts many replicas of one service.

A service mesh (Istio, Linkerd, Consul Connect) runs a sidecar proxy next to every service instance and handles service-to-service traffic inside the cluster. It gives you mTLS, retries, circuit breakers, traffic shifting, and observability for east-west traffic — the calls services make to each other. The gateway is north-south (client-to-cluster); the mesh is east-west (service-to-service).

A mature platform uses all three in layers. The edge load balancer terminates connections and sheds obviously bad traffic. The gateway terminates TLS, authenticates, rate-limits, and routes to the right service. The service mesh handles the internal RPC hops with mTLS and retries. Each layer has exactly one job, and none of them duplicates another. The common mistake is to try to collapse layers — running Istio’s ingress gateway and hoping it also handles OAuth and per-tenant quotas — which usually ends with a Frankenstein configuration no one fully understands.

73.3 Path, host, and header routing

The gateway’s core job is to look at an incoming request and decide which backend should handle it. Three routing styles cover almost every case.

Path-based routing is the default. The URL path prefix selects the backend.

/v1/chat/completions    → llm-service
/v1/embeddings          → embedding-service
/v1/rerank              → reranker-service
/v1/vectors/*           → vector-db-service
/v2/documents           → document-service-v2

This is the simplest and most common pattern. It maps cleanly to how engineers think about APIs and produces clean SDK boundaries. The downside: path prefixes leak internal structure. If every LLM-related path goes to llm-service, clients start to reason about llm-service as a thing, and later splitting it is harder.

Host-based routing uses the Host header. chat.example.com and embed.example.com can terminate on the same gateway fleet but route to different backends. This is common when different products share a gateway but have different SLAs or security postures. It also lets different tenants get their own hostname for branding and for cookie scoping.

Header-based routing is the escape hatch. Route on an arbitrary header: X-Model-Family: claude → one backend, X-Model-Family: llama → another. This is what you use for A/B tests, canaries, per-tenant routing, and feature flagging. It is also the mechanism for a “preview” or “staging” backend reachable from production by setting a header — dangerous but sometimes necessary.

Real configurations combine all three. Envoy and Kong both express this as a matcher DSL: “if host matches api.example.com and path starts with /v1/chat and header X-Beta is true, route to llm-service-beta, else route to llm-service-stable.”

73.4 The BFF (backend-for-frontend) pattern

A pure gateway routes one incoming request to one backend. A backend-for-frontend does more: it composes calls to several backends on behalf of a specific frontend (a web app, a mobile app, a CLI) and returns a single response shaped for that frontend’s needs.

The problem the BFF solves is that different frontends want different shapes of data. A mobile app wants the minimum necessary fields to render a chat screen: last message, typing indicator, user name. A web dashboard wants the full conversation history, the rate limit status, the billing tier, and the model capabilities. A CLI wants a streaming token feed with no markup. If every frontend calls the same gateway endpoint, either the endpoint returns the union of everything (bloating every response) or each frontend makes three or four round trips and stitches on the client.

The BFF fixes this by putting a per-frontend aggregator in front of the core gateway. The mobile BFF exposes /mobile/chat/:id and internally calls GET /v1/messages?limit=20, GET /v1/presence, and GET /v1/user/profile, then merges the results into a mobile-shaped response.

The BFF composes calls in parallel so total latency is the max of the sub-calls, not the sum; each frontend gets a response shaped for its exact needs.

The web BFF exposes the same logical endpoint but fetches more fields and returns a richer shape. Neither app makes extra round trips, and neither has to know the internal topology.

The BFF is where you put frontend-specific concerns that do not belong in core services: rendering fallback content for a slow backend, inlining cache headers for a specific CDN, translating between JSON and GraphQL for a team that standardized on one. A common mistake is letting BFFs grow their own business logic; they should stay thin composition layers. The moment a BFF starts applying rate limits or making authorization decisions, it has become another core service and should be named one.

73.5 Request aggregation and response stitching

Aggregation is the BFF’s signature move: one inbound request becomes several outbound requests, and the results are stitched together. A concrete example from an ML platform:

GET /v1/assistant/status  → {
  models:    callHTTP(/v1/models/list),
  quota:     callHTTP(/v1/billing/quota?tenant=T),
  usage:     callHTTP(/v1/metering/usage?tenant=T&window=1d),
  health:    callHTTP(/v1/health/summary)
}

The BFF launches the four backend calls concurrently (not sequentially — this is the whole point), waits for them with a deadline, and composes the response. The total latency is max(call_1, call_2, call_3, call_4) plus a small overhead, not the sum.

The stitching logic needs to handle partial failure. If the quota service is down, you do not want the whole response to fail; you want to return the pieces that succeeded and mark quota as null with an error field. Decide up front which fields are required and which are best-effort, and make that decision explicit. A common anti-pattern is “500 if any subcall fails,” which turns a tiny dependency into a hard dependency for every request.

Set tight deadlines on the subcalls. If the outer request has a 2 s deadline and you launch four subcalls in parallel, each subcall should get a ~1.8 s deadline, not the full 2 s. That margin accounts for stitching latency and makes sure the outer request honors its SLA. Propagate deadlines via a standard header (X-Request-Deadline or gRPC’s built-in deadline) so the backends can shed work that will never be used.

The aggregator also has to think about fairness: one slow backend should not starve the others. Use concurrent futures with independent cancellation, not a blocking loop. Most gateway frameworks handle this; if you are writing it by hand (do not), asyncio.gather with return_exceptions=True is the canonical Python version.

73.6 Fan-out and scatter/gather

Aggregation calls different backends. Fan-out calls the same kind of backend many times and merges the results. This is scatter/gather, and it shows up everywhere in ML platforms.

A vector search request fans out to every shard of the vector index. A federated search over tenant shards fans out to all the tenant’s shards. A multi-model evaluator fans out to every candidate model with the same prompt. A disaggregated prefill/decode setup (Chapter 36) may fan out one logical request to a prefill replica and a decode replica through a proxy gateway.

Scatter/gather has its own pathologies. The tail latency is governed by the slowest shard, which is brutal math: if you fan out to 100 shards and each has a 1% chance of being slow, there is a 63% chance that some shard is slow on every request. The standard defense is hedged requests (send a second copy of the query to a different replica after a timeout, take whichever returns first) and backup requests (send to two replicas from the start, take the fastest). These are cheap if the replicas are cheap; on GPU-backed services they get expensive fast.

The other pathology is partial result reasoning. If 99 of 100 shards return and one times out, do you return the 99 and mark the response degraded, or do you fail the whole request? For vector search you usually return partial results (recall drops a bit, nobody dies). For billing aggregation you fail hard because incomplete answers are worse than none. That decision lives in the gateway.

Fan-out is also where result merging gets interesting. Top-k across shards requires each shard to return its local top-k plus some margin, then the gateway merges and re-ranks. If the merger is naive, a shard with very different score distributions can dominate and hurt overall recall. This is a Chapter 46 topic (hybrid search); mention it so you remember it lives at the gateway layer.

73.7 Cross-cutting concerns: auth, rate limiting, telemetry

The gateway is where cross-cutting policy lives. The three big ones:

Authentication (Chapter 74) — verify who the caller is. Extract the token from an Authorization header, validate the signature and expiration, extract the principal, attach it to the request context, pass it downstream. Reject with 401 if anything is wrong. The gateway does this once so every backend can assume the caller is authenticated.

Rate limiting and quotas (Chapter 76) — enforce request-rate and cost-based limits. Different tenants get different limits. Different endpoints get different limits. Different models get different limits. The gateway applies them before forwarding so a rogue client cannot run up a bill or pin a GPU by spamming. A good gateway supports multiple concurrent limits (“max 10 rps per key AND max 1M tokens per day per tenant”) and returns 429 with Retry-After when any limit is hit.

Telemetry and tracing — initialize a request ID if one is not present, start a distributed trace span, emit access logs, record metrics (rate, errors, duration). Everything downstream inherits the trace context. A production gateway emits at least: request count by route and status, latency histograms by route, bytes in/out, active connection counts, and upstream health. See Chapter 95 for the tracing details.

These three concerns are policy, not logic. They should be declarative — a YAML or CRD that names the route, the auth requirements, and the rate limits — not code scattered across handlers. The gateway executes the policy. The day that policy gets complicated enough to require a Turing-complete DSL is the day you have outgrown your gateway and need to think about splitting it or moving logic into dedicated policy services.

73.8 The gateway as a contract boundary

Public API versioning lives at the gateway. When /v1/chat/completions becomes /v2/chat/completions, the gateway routes /v1 to a compatibility shim that calls the new backend with old-style parameters and translates the response. Clients on /v1 see no change; the team ships the new backend without a migration gun pointed at every user.

Contract tests live here too. The gateway’s OpenAPI schema (or gRPC proto, or GraphQL schema) is the authoritative contract. Backend services can change internal shapes freely as long as the gateway’s translation layer keeps the external shape stable. This is the BFF idea generalized: the gateway owns the public contract, the backends own the implementation, and the translation is in one place where it can be tested.

A common mistake is to let backends expose their native APIs directly through path-only routing. This couples the public API to internal shape, and every backend refactor breaks clients. The gateway should at minimum validate the request against the schema, reject anything that does not match with a 400, and redact or transform fields as the contract requires. If the gateway is just a dumb proxy, you have no contract boundary.

The contract boundary is also where deprecation happens. Add a Deprecation header when a field is going away. Log structured events for every call to a deprecated route so you can go from “sunset date” to an email list of tenants still using it. Without this visibility, deprecation never happens; with it, deprecation is a spreadsheet.

73.9 Why ML platforms always have a gateway

Every non-trivial LLM platform converges on the same topology: edge → API gateway → auth → rate limit → routing → model services. The reasons stack up until a gateway is the only sane answer.

Heterogeneous backends: /chat goes to a vLLM pool, /embeddings goes to a TEI pool, /rerank goes to a separate reranker service, /vectors goes to a vector DB, /images goes to a different runtime entirely. Each of these runs different software, different hardware, different scaling policies. Routing is a first-class problem. A gateway solves it.

Multi-tenant isolation: tenants get different rate limits, different model access, different quota pools. The enforcement has to be at the edge, before any expensive backend work. A gateway is the only place that can do this without duplicating the logic in every model service.

API stability: SDKs cannot break. The OpenAI-compatible SDK, the Anthropic-compatible SDK, the internal SDK — all of them expect specific shapes that outlast any backend. A gateway with a translation layer keeps the contract stable through backend churn.

Cost protection: every request to a GPU service costs real money. Dropping a bad request at the gateway costs a few microseconds of CPU. Dropping it deep in the backend stack costs 30 ms of GPU time at minimum and potentially a full prefill. Shift enforcement left.

Observability: one place to measure request rate, latency, and error rate, cleanly broken down by route and tenant. Without the gateway, you stitch metrics together from N services, which drift.

Connection from Chapter 50: the AI gateway from Chapter 50 (semantic caching, provider routing, content filters, LLM-specific cost accounting) is a specialization of this pattern. It is the same architectural position with ML-specific policies bolted on. A platform can have a general-purpose gateway at the edge and an AI-gateway layer inside for LLM-specific concerns, or it can collapse them into one fleet. Both are fine; the layering is what matters.

73.10 Failure modes and anti-patterns

A list of the ways gateways go wrong in practice.

The gateway becomes a monolith. Every new feature lands in the gateway because “it is the logical place.” Eventually the gateway has 40,000 lines of business logic, and deploying it is a weekly incident. Fix: push logic into dedicated policy services (auth, rate limit) that the gateway calls, keep the gateway itself declarative.

Single point of failure. One gateway cluster serves everything. When it restarts badly, the whole platform is down. Fix: run the gateway as a stateless fleet with N replicas behind a boring load balancer, deploy changes through rolling deployments, test rollbacks weekly. Never cache state in the gateway process unless it is ephemeral and reconstructible.

Chatty aggregation. A BFF fans out to 15 backends on every request, each call with a p99 of 100 ms. The BFF’s p99 is 400 ms even though each backend is fast. Fix: aggregate less, cache more, denormalize the data a BFF needs into the response of a single upstream call.

Deadlines that do not propagate. A client gives up after 1 s. The gateway does not propagate the deadline, and backends keep working for 10 s producing results nobody is reading. Fix: propagate deadlines end-to-end via a standard header; honor them in every handler.

Retries without budget. The gateway retries on 5xx. Every backend also retries on 5xx. One failing leaf node causes 3× load three hops up, and the whole platform melts. Fix: retry budgets (Chapter 77), retry only at the outermost layer, mark retried requests with a header so downstream skips their own retry.

Routing drift. Routes are configured in three places — gateway config, service mesh, ingress. Two disagree. Requests hit the wrong backend for a week before anyone notices. Fix: one authoritative source of routing (usually the gateway), everything else generated from it.

Silent transformation bugs. The gateway’s request/response translation has a bug that drops one field. Nobody notices for a month. Fix: contract tests that run on every gateway deploy with golden inputs and expected outputs.

73.11 The mental model

Eight points to take into Chapter 74:

The gateway is a runway for every request. Auth, rate limiting, routing, logging, TLS — all centralized so backends do not duplicate them.
Gateway, mesh, load balancer are three different things. North-south vs east-west vs dumb pipe. Mature platforms use all three, layered.
Routing is path-based, host-based, or header-based. Real configs combine all three with a matcher DSL.
BFFs are per-frontend aggregators. They compose calls and stitch responses shaped for one specific client.
Aggregation must be parallel with deadlines and tolerant of partial failure. Fan-out (scatter/gather) is a BFF with identical backends and merge logic.
The gateway is the public API contract boundary. Version here, deprecate here, validate here.
ML platforms always have a gateway because backends are heterogeneous, cost-sensitive, and multi-tenant. Chapter 50’s AI gateway is the LLM-specific specialization.
Failure modes are predictable: monolith creep, no deadline propagation, uncontrolled retries, single point of failure, routing drift.

In Chapter 74 the first cross-cutting concern gets its own deep dive: authentication versus authorization, the distinction interviewers test on.

Read it yourself

Sam Newman, Building Microservices (2nd ed.), chapters on API gateways and the BFF pattern. The canonical modern treatment.
Phil Calçado, “The Back-end for Front-end Pattern (BFF).” The original write-up from the SoundCloud team.
Envoy Proxy documentation, especially the HTTP routing and filter chain sections.
Kong Gateway documentation on plugins and the route/service model.
The Kubernetes Gateway API specification (gateway-api.sigs.k8s.io).
Istio’s documentation on ingress gateways and the distinction from sidecar proxies.

Practice

Sketch a gateway config (YAML is fine) that routes /v1/chat/completions to llm-service, /v1/embeddings to embed-service, and both to a beta backend when the header X-Beta: true is present.
Write the pseudocode for a BFF endpoint that calls four backend services concurrently with a 1.5 s deadline, tolerates failure of any single one, and returns a merged response.
Explain the difference between an API gateway and a service mesh. Give one concern each handles uniquely.
Why does the gateway have to propagate deadlines? Construct a scenario where missing deadline propagation causes a cascading failure.
A client rate-limits at 10 rps but the gateway enforces 5 rps. Whose limit wins and why? What should the gateway return?
For a scatter/gather over 20 shards with per-shard p99 of 50 ms, estimate the aggregate p99 before and after hedged requests.
Stretch: Stand up a local Envoy or Kong instance with two backends. Implement path-based and header-based routing. Add a rate limit. Confirm the 429 behavior with a load generator like hey or k6.