Part III · Inference Internals & Production Serving
Chapter 50 ~14 min read

AI gateways: Envoy AI Gateway and the OpenAI-compatible front door

"You don't expose vLLM directly to users. You put a gateway in front of it. Always"

In Chapter 49 we covered TEI; in Chapters 45-46 we covered vLLM and KServe. Each is a runtime that serves a single model (or a small set). Production deployments run many models — different sizes, different families, different specializations, different fine-tunes — and clients need a single API to talk to all of them.

The component that provides this single API is the AI gateway. By the end of this chapter you’ll know what AI gateways do, what the major options are, and how to design a routing layer for a production LLM platform.

Outline:

  1. The gateway pattern.
  2. The OpenAI-compatible API as the lingua franca.
  3. What an AI gateway adds on top of a general-purpose API gateway.
  4. Envoy AI Gateway.
  5. LiteLLM as a gateway.
  6. Other options: Portkey, Kong AI Gateway, vLLM router.
  7. Routing strategies.
  8. Authentication, rate limiting, and observability at the gateway.
  9. The buffer-limit gotcha.
  10. The decision matrix.

50.1 The gateway pattern

The basic motivation for an AI gateway is the same as for any API gateway: decouple the client interface from the backend services.

Without a gateway:

  • Each client has to know about every model’s specific endpoint.
  • Authentication has to be implemented per backend.
  • Rate limiting has to be enforced per backend.
  • Routing logic has to live in client code.
  • Observability requires instrumenting every backend separately.

With a gateway:

  • Clients see a single API endpoint and a list of model names.
  • Authentication is centralized.
  • Rate limiting is centralized.
  • Routing is centralized.
  • Observability is centralized.

For LLM serving, the gateway pattern is essential because:

  • Many models are typically served (different sizes, different fine-tunes, different vendors).
  • Multiple backends may serve the same model (for load balancing or failover).
  • Model versioning requires routing logic (route v1 to old fleet, v2 to new fleet, canary at percentage).
  • Rate limiting is per-tenant and needs central enforcement.
  • Cost tracking by token requires central observation.

The AI gateway is the component that ties the platform together at the API layer.

The AI gateway provides a single OpenAI-compatible endpoint and routes by model name to the appropriate backend, centralizing auth, rate limiting, and observability. Clients model: "llama" model: "bge" AI Gateway auth / rate-limit token counting model routing observability buffer > 50 MB vLLM (llama) chat completions vLLM (qwen) chat completions TEI (bge) embeddings routing key = model field in OpenAI request
The AI gateway is the single entry point: it validates auth, enforces per-tenant token rate limits, routes by model name, and fans out to independent backend clusters.

50.2 The OpenAI-compatible API as the lingua franca

The de facto standard API for LLM serving is OpenAI’s API. Specifically:

  • POST /v1/chat/completions — chat-style generation.
  • POST /v1/completions — legacy completion-style generation.
  • POST /v1/embeddings — text embeddings.
  • GET /v1/models — list available models.

OpenAI’s API documentation defines the request and response formats in detail. Every major LLM serving framework (vLLM, SGLang, TGI, TensorRT-LLM, llama.cpp, TEI for embeddings) emits OpenAI-compatible APIs by default. This means a client written against OpenAI’s API works against any of these backends with no code changes — just point at a different URL.

This standardization is enormously valuable. It means:

  • Clients can use OpenAI’s official Python/JS libraries (openai-python, openai-node) to talk to any LLM backend.
  • Switching backends is a configuration change, not a code change.
  • The gateway just has to forward OpenAI-compatible requests to the right backend.

There are some non-OpenAI APIs (Anthropic’s Messages API, Google’s Gemini API, etc.) but the OpenAI standard dominates the open ecosystem. Most AI gateways are designed primarily around it, with optional adapters for other formats.

The gateway’s job: accept OpenAI-compatible requests, route to the right backend, return OpenAI-compatible responses.

50.3 What an AI gateway adds

An AI gateway is a specialization of a general-purpose API gateway with LLM-specific features. The features that distinguish “AI gateway” from “general gateway”:

(1) Model-name-based routing. The OpenAI request specifies a model field (e.g., "model": "gpt-4o" or "model": "llama-3-70b"). The gateway routes based on this field — different models go to different backends.

(2) Streaming support. OpenAI APIs support server-sent events (SSE) for streaming token-by-token responses. The gateway must forward streams correctly without buffering.

(3) Token counting. The gateway counts input and output tokens to enforce rate limits and feed billing.

(4) Long-response handling. LLM responses can be very long. The gateway needs large buffer limits (or streaming) to handle them.

(5) Provider abstraction. Some gateways translate between API formats (OpenAI → Anthropic, OpenAI → Cohere, etc.) so clients can use one API for many providers.

(6) Prompt caching awareness. Some gateways are aware of which routes support prefix caching and route accordingly.

(7) Per-token rate limiting. Standard rate limiters count requests; AI gateways count tokens (because token cost is the actual constraint).

These are the additions. A general-purpose gateway (Nginx, plain Envoy, AWS API Gateway) can be hacked into doing these things, but a purpose-built AI gateway handles them out of the box.

50.4 Envoy AI Gateway

Envoy AI Gateway is a Kubernetes-native AI gateway built on top of Envoy and the Envoy Gateway project. It’s part of the broader Envoy ecosystem and is open source under the Envoy Foundation.

Envoy AI Gateway uses the Kubernetes Gateway API (the standard CRDs for routing in K8s) and adds AI-specific extensions. The components:

  • Envoy Gateway — the underlying L7 proxy.
  • AIGatewayRoute — a CRD that defines AI-specific routing rules.
  • AIServiceBackend — a CRD that defines a backend (a vLLM instance, a TEI instance, an external API).
  • BackendSecurityPolicy — for authentication and credentials.

A minimal AIGatewayRoute:

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: llm-routes
spec:
  schema:
    name: OpenAI
  rules:
    - matches:
        - headers:
            - name: x-ai-eg-model
              value: llama-3-70b
      backendRefs:
        - name: vllm-llama3-70b
          weight: 100
    - matches:
        - headers:
            - name: x-ai-eg-model
              value: qwen-2.5-72b
      backendRefs:
        - name: vllm-qwen-72b
          weight: 100

This routes requests by model name to different backends.

Strengths:

  • Kubernetes-native. Integrates with Gateway API, ArgoCD, and the rest of the K8s ecosystem.
  • Production-grade. Envoy is battle-tested at scale (used by Istio, Lyft, Airbnb, etc.).
  • Extensible. Custom Wasm filters for advanced logic.
  • Multi-format support. OpenAI, Anthropic, AWS Bedrock formats.
  • Open source under CNCF. No vendor lock-in.
  • Token-based rate limiting.
  • Tracing and metrics via Envoy’s existing observability stack.

Weaknesses:

  • Newer than vLLM/KServe. Less battle-tested in production.
  • Complex configuration. Multiple CRDs and concepts.
  • Buffer limits need careful tuning for long LLM responses (more on this in §50.9).

Envoy AI Gateway is the modern default for K8s-based AI gateway deployments as of late 2025. Most new platforms standardize on it.

50.5 LiteLLM as a gateway

LiteLLM is a Python-based LLM proxy that started as a client-side library for unified LLM access and has grown into a full-featured server-side gateway.

The LiteLLM gateway:

  • Accepts OpenAI-compatible requests.
  • Translates to many backend providers (OpenAI, Anthropic, Cohere, Together, Anyscale, vLLM, custom).
  • Handles rate limiting, retries, fallbacks, cost tracking.
  • Provides a web UI for management.

Strengths:

  • Easy to start. Run as a single Python process.
  • Wide provider support. 100+ provider integrations.
  • Good for hybrid clouds. Use OpenAI for some queries, local LLMs for others.
  • Active community.

Weaknesses:

  • Python performance. Single-process Python isn’t as fast as Envoy.
  • Operationally heavier than Envoy for K8s deployments.

LiteLLM is the right choice when:

  • You want to mix many LLM providers (OpenAI + Anthropic + self-hosted).
  • You want a Python-first gateway with a web UI.
  • You’re not running on K8s.

For pure K8s self-hosted deployments, Envoy AI Gateway is more common.

50.6 Other options

A few more gateways worth knowing:

Portkey AI Gateway. Open-source AI gateway with strong observability and routing features. Used by some teams as an alternative to LiteLLM.

Kong AI Gateway. Kong’s commercial AI gateway plugin. Targets enterprise with Kong’s feature set.

vLLM Router. A simpler alternative — vLLM has a built-in router that can route between multiple vLLM instances. Less feature-rich than Envoy but simpler to deploy if you only have vLLM.

Cloudflare Workers AI Gateway. Cloudflare’s hosted AI gateway. Useful if you’re already on Cloudflare.

Custom gateways. Many teams build their own (Python FastAPI, Go, Rust). The features they need are not exotic; building one takes a few weeks.

The space is competitive. Pick based on your existing infra and feature needs.

50.7 Routing strategies

The routing logic at the gateway is where most of the AI-specific complexity lives. The strategies:

(1) Model-name-based routing. The simplest. The request’s model field maps to a backend. gpt-4o → OpenAI, llama-3-70b → vLLM cluster A, qwen-2.5-72b → vLLM cluster B.

(2) Header-based routing. Use a custom header (e.g., X-Tenant: acme-corp) to route to tenant-specific backends or apply tenant-specific limits.

(3) Path-based routing. Different URL paths route to different backends. /v1/chat/completions → LLM backend, /v1/embeddings → TEI backend.

(4) Weight-based routing (canary). Route 95% of traffic to v1, 5% to v2. Used for gradual rollouts.

(5) Sticky routing. Pin a session to a specific backend (for prefix cache locality, for example).

(6) Failover. If primary backend is down, route to a fallback. LiteLLM has rich fallback support.

(7) Cost-based routing. Route to the cheapest available backend that satisfies the request (e.g., use a cheap small model when the prompt is simple, escalate to a big model only when needed). This is a more advanced pattern.

(8) Latency-based routing. Route to the fastest backend. Useful when the same model is served by multiple providers with different latencies.

For most production deployments, routing is model-name-based with weight-based canary support plus per-tenant rate limiting. The fancier strategies (cost-based, latency-based) are for sophisticated deployments.

graph TD
  R[Incoming request] --> M{model field?}
  M -->|llama-3-70b| W{canary split}
  M -->|bge-large-en| TEI[TEI backend]
  M -->|unknown| E[404 model not found]
  W -->|95%| V1[vLLM fleet v1]
  W -->|5%| V2[vLLM fleet v2 canary]
  style W fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style V2 fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Model-name routing sends embedding requests to TEI and chat requests to the appropriate vLLM fleet; the canary split keeps 5% of traffic on the new version until it is validated.

50.8 Authentication, rate limiting, observability

The cross-cutting concerns the gateway handles:

Authentication

The gateway terminates authentication. Common patterns:

  • API key in header (Authorization: Bearer sk-...). Standard for OpenAI-compatible APIs.
  • JWT (more sophisticated, with claims about the user/tenant).
  • OAuth2 (for user-facing applications).
  • mTLS (for service-to-service).

The gateway validates the credentials, extracts the user/tenant identity, and forwards to the backend with optional headers (X-User-Id, X-Tenant, etc.).

Rate limiting

LLM rate limiting is token-based, not request-based. A single request can cost 100k tokens or 100 tokens. Rate limiting in tokens is the right unit.

The gateway tracks:

  • Tokens per minute (TPM) per tenant or per API key.
  • Requests per minute (RPM) as a secondary cap.
  • Concurrent requests to prevent denial-of-service.

When a tenant exceeds their limit, the gateway returns 429 Too Many Requests. OpenAI uses the same pattern; Envoy AI Gateway implements it natively.

Observability

The gateway sees every request, so it’s the right place to capture observability:

  • Request count per model, per tenant.
  • Token count (input and output) per model, per tenant.
  • Latency distribution per model.
  • Error rate per model and per error type.
  • Cost computed from token counts and model pricing.

The gateway exports these metrics to Prometheus, traces to Jaeger/Tempo, and logs to Loki.

50.9 The buffer-limit gotcha

One specific operational gotcha worth knowing: Envoy’s default buffer limit is too small for LLM responses.

Envoy buffers HTTP responses up to a configurable size (max_request_bytes, max_response_bytes) before forwarding. The default is 1 MB. For non-LLM APIs, this is plenty.

For LLM responses, especially long ones (reasoning model outputs, code generation, document summarization), responses can easily exceed 1 MB. When they do, Envoy truncates the response, the client gets a partial response, and the model’s actual output is lost.

This has caused multiple production incidents in real Envoy AI Gateway deployments. The fix is to increase the buffer limit for AI gateway routes:

spec:
  responseBuffer:
    maxBytes: 50Mi

Setting the buffer to 50 MB (or higher) covers virtually all LLM responses. The cost is some additional memory in Envoy (buffers are per-request, so the worst case is max_buffer × concurrent_requests).

For streaming responses (SSE), the buffer issue is different — Envoy needs to forward chunks immediately rather than buffering the whole response. The streaming path has its own configuration (upstream_buffering and similar). Get this right or your streaming responses will be choppy.

The buffer-limit gotcha is one of the top operational issues with Envoy-based AI gateways. Always set explicit buffer limits.

With default 1 MB Envoy buffer, a long LLM response is truncated at the buffer boundary; setting the buffer to 50 MB or using streaming passthrough eliminates the truncation. Default buffer (1 MB) LLM response (e.g. 3 MB) 1 MB truncated here client gets partial response Set buffer to 50 MB LLM response (3 MB) all within 50 MB buffer client gets complete response
Envoy's default 1 MB buffer truncates LLM responses that exceed it; setting responseBuffer.maxBytes to 50 Mi (or enabling streaming passthrough) prevents silent truncation.

50.10 The decision matrix

Use caseRecommended gateway
K8s-native, production-grade, defaultEnvoy AI Gateway
Multi-provider hybrid (OpenAI + self-hosted), Python-firstLiteLLM
Enterprise with KongKong AI Gateway
Cloudflare-based stackCloudflare Workers AI Gateway
Simple vLLM-only deploymentvLLM Router (built-in)
Custom requirementsBuild your own in FastAPI / Go

For most production K8s LLM platforms in 2025: Envoy AI Gateway. It’s the modern default, integrates with the rest of the K8s ecosystem, and is well-supported by the Envoy community.

50.11 The mental model

Eight points to take into Chapter 51:

  1. An AI gateway is the single API entry point for your LLM platform.
  2. OpenAI-compatible API is the lingua franca. Every gateway and backend speaks it.
  3. AI gateways add LLM-specific features on top of general API gateways: model routing, streaming, token counting, large buffers.
  4. Envoy AI Gateway is the K8s-native default.
  5. LiteLLM is the Python-first multi-provider option.
  6. Routing strategies include model-name, weight-based canary, per-tenant, cost-based.
  7. Authentication, rate limiting, observability all happen at the gateway. Token-based rate limiting is the right unit.
  8. Set explicit buffer limits. Default Envoy buffers are too small for LLM responses.

In Chapter 51 we look at the autoscaling layer in detail: KEDA for GPU inference.


Read it yourself

  • The Envoy AI Gateway documentation.
  • The Envoy Gateway documentation (the underlying L7 proxy).
  • The LiteLLM documentation and GitHub repo.
  • The OpenAI API reference (the standard your gateway implements).
  • The Kubernetes Gateway API spec.

Practice

  1. Why does the OpenAI-compatible API matter so much for the LLM ecosystem? List three benefits.
  2. Sketch a multi-model routing configuration: route gpt-4o to OpenAI, llama-3-70b to a local vLLM, and bge-large-en to a local TEI.
  3. Why is token-based rate limiting the right unit for LLM serving? Compare to request-based rate limiting.
  4. The default Envoy buffer is 1 MB. Why is this too small for LLM responses? Construct an example.
  5. When would you use LiteLLM over Envoy AI Gateway? Identify a specific use case.
  6. Why does the gateway terminate authentication instead of letting each backend handle it?
  7. Stretch: Set up Envoy AI Gateway locally with two vLLM backends. Configure routing by model name. Verify that requests are routed correctly.