Chapter 49: TEI for embeddings and rerankers in production

In Chapter 9 we covered embedding models and rerankers — the encoder-only side of the transformer family that powers retrieval, search, and reranking. In Chapter 49 (this one) we cover the production runtime for serving them: Text Embeddings Inference (TEI).

The pitch from earlier chapters: embedding models and rerankers have a completely different operational profile from generative LLMs, and using vLLM for them is the wrong tool. Their inputs are short and fixed-length, their outputs are vectors not tokens, they have no KV cache to manage, and their batching strategy is different. TEI is the runtime that’s specifically designed for this workload.

This chapter is the production reference for TEI. By the end you’ll know how to deploy and tune it for embedding and reranking workloads at scale.

Outline:

Why TEI is its own runtime.
The TEI architecture.
Supported model types.
Serving embedding models.
Serving rerankers.
The configuration flags.
Multimodal embedding (Qwen-VL embedders, etc.).
Autoscaling on te_queue_size.
Operational considerations.

49.1 Why TEI is its own runtime

Recall the differences between generative and discriminative serving:

Property	Generative LLM (vLLM)	Embedding model (TEI)
Model architecture	Decoder-only	Encoder-only
Output	Stream of tokens	Single fixed-shape vector
KV cache	Yes, large, dominates memory	None
Batching	Continuous, token-level	Static, request-level
Sequence length	Variable, can be huge	Short, often fixed (<8k tokens)
Optimization target	Throughput + TPOT	Latency + throughput
Concurrency model	Many in-flight at varying positions	All requests in a batch finish at once

These differences mean that the optimizations vLLM is built for (PagedAttention, continuous batching, KV cache management) don’t apply to embedding models. There’s no KV cache to page; there’s no autoregressive loop to batch incrementally; there’s no sequence-level variability to schedule around.

vLLM runs an autoregressive decode loop where the KV cache grows per token; TEI runs one batched encoder pass and returns all vectors simultaneously — no KV cache needed.

What embedding models need:

Fast batched encoder forward passes. Take a batch of input texts, run them through the encoder, return the embeddings.
Efficient batching. Group incoming requests into batches up to a max size or max wait time.
Tokenizer integration. Tokenize inputs efficiently before forward.
Model quantization for memory and throughput.
Multi-model serving (sometimes one server holds embedder + reranker).

This is a fundamentally different runtime from vLLM. The Hugging Face team built Text Embeddings Inference (TEI) specifically for this workload, in Rust, with a focus on minimal overhead and fast cold starts.

49.2 The TEI architecture

TEI is a single-binary Rust application. The architecture:

HTTP server that accepts embedding requests via a REST API.
Tokenizer (using the tokenizers Rust library) that batches and tokenizes inputs.
Inference engine that runs the model on GPU (via candle, the Rust ML library) or CPU.
Batch scheduler that groups incoming requests into batches.
Output formatter that returns vectors in the API response.

The key design choices:

Rust + candle instead of Python + PyTorch. Faster startup, smaller memory footprint, better latency.
Static batching with timeouts. TEI collects incoming requests for a brief window (a few ms), batches them, runs one forward pass, returns results.
Custom kernels for the most common architectures. TEI has hand-tuned kernels for BERT, JinaBERT, BGE, and other common embedding architectures.
OpenAI-compatible API. TEI exposes /embeddings (matches OpenAI’s API), /rerank (for rerankers), and a few others.

The result: TEI is significantly faster than running the same model in PyTorch for embedding workloads. The startup time is faster (Rust + candle vs Python + PyTorch), the per-request overhead is lower, and the throughput is higher at the same hardware.

49.3 Supported model types

TEI supports several categories of models:

Embedding models (bi-encoders). The largest category. Most of the popular embedders work:

BGE family (bge-small-en, bge-base-en, bge-large-en, bge-m3).
E5 family (e5-small-v2, e5-large-v2, e5-mistral-7b-instruct).
GTE family.
Jina embeddings.
Sentence-transformers compatible models.
Nomic Embed.

Cross-encoder rerankers. The second category:

BGE rerankers (bge-reranker-base, bge-reranker-large, bge-reranker-v2-m3).
Jina reranker.
mxbai-rerank.

Sequence classification models. Bonus category — models that produce a class label or score for an input. Useful for classification, sentiment analysis, content moderation.

Multimodal embedders. Newer support for image and text embedders (CLIP-like models, Qwen-VL embedders, etc.).

The list grows over time as the TEI team adds support for new architectures. As of late 2025, almost any popular open embedder works with TEI.

49.4 Serving embedding models

A minimal TEI deployment for an embedding model:

docker run --gpus all -p 8080:80 \
  -e MODEL_ID=BAAI/bge-large-en-v1.5 \
  ghcr.io/huggingface/text-embeddings-inference:1.5

That’s it. Pull the image, set the model ID, expose port 80. TEI handles the rest: downloading the model, tokenizing inputs, running batched forward passes, returning embeddings.

The API is OpenAI-compatible:

curl http://localhost:8080/embed \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["Hello, world!", "How are you?"]}'

Returns a JSON array of embedding vectors:

[[0.123, -0.456, ...], [0.234, -0.567, ...]]

For OpenAI compatibility:

curl http://localhost:8080/v1/embeddings \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-large-en-v1.5",
    "input": ["Hello, world!", "How are you?"]
  }'

This matches the OpenAI Embeddings API exactly, so you can use OpenAI client libraries (Python, JS) to talk to TEI without modification.

49.5 Serving rerankers

For a cross-encoder reranker:

docker run --gpus all -p 8080:80 \
  -e MODEL_ID=BAAI/bge-reranker-large \
  ghcr.io/huggingface/text-embeddings-inference:1.5

The API for reranking is slightly different:

curl http://localhost:8080/rerank \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "texts": [
      "ML is a field of computer science.",
      "Pizza is a popular food.",
      "ML uses statistics to learn from data."
    ]
  }'

Returns a list of (index, score) pairs sorted by score:

[
  {"index": 2, "score": 0.95},
  {"index": 0, "score": 0.85},
  {"index": 1, "score": 0.05}
]

The reranker scores each (query, text) pair with one cross-encoder forward pass, then returns the sorted list.

For a typical retrieval pipeline (Chapter 9):

Bi-encoder embeds the query and finds top-100 candidates from a vector database.
Cross-encoder reranks the top-100 to find the top-10.
Top-10 is fed to the LLM as context.

You’d run two TEI instances: one for the bi-encoder, one for the reranker. They’re separate processes but use the same TEI binary.

Two separate TEI instances handle retrieval and reranking; the reranker's top-10 list is the only context fed to the expensive LLM, saving compute and improving relevance.

49.6 The configuration flags

TEI has fewer flags than vLLM because the workload is simpler. The important ones:

--model-id. The model to load. HuggingFace ID or local path.

--revision. Pin to a specific git revision. Use in production.

--max-concurrent-requests. The maximum number of in-flight requests. Default 512.

--max-batch-tokens. The maximum number of tokens in a single batch (sum across requests). Default 16384. Tune for throughput vs latency.

--max-batch-requests. The maximum number of requests per batch. Default unlimited. Use to cap batch size for memory or latency.

--max-client-batch-size. The maximum number of inputs per client request. Default 32. Increase if your clients send large batched requests.

--port. HTTP port. Default 80.

--hostname. Bind address. Default 0.0.0.0.

--dtype. The compute dtype. Options: float16, float32. Default depends on the model. Use float16 for speed.

--pooling. Override the pooling strategy. Options: cls, mean, splade, lasttoken. Default is whatever the model card specifies.

--auto-truncate. Automatically truncate inputs that exceed the model’s max length. Default false. Enable in production to handle long inputs gracefully instead of erroring.

--api-key. Optional API key for authentication. Same caveat as vLLM: don’t rely on this for production security.

--otlp-endpoint. OpenTelemetry trace endpoint.

For most deployments, the defaults are good. Tune --max-batch-tokens for your throughput requirements.

49.7 Multimodal embedding

A relatively new TEI feature: support for multimodal embedders that take both text and images. The major use case is models like Qwen-VL embedding variants, CLIP-based embedders, and SigLIP-based embedders.

The API is extended to accept images:

curl http://localhost:8080/embed \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {"type": "image", "url": "https://example.com/image.jpg"},
      {"type": "text", "value": "a cat"}
    ]
  }'

The model encodes both inputs into the same embedding space, allowing cross-modal retrieval (find images similar to a text query, or vice versa).

Multimodal embedders are essential for modern multimodal RAG and search systems. TEI’s support is maturing in 2024-25 and is increasingly used in production.

49.8 Autoscaling on `te_queue_size`

TEI exposes Prometheus metrics on /metrics, including:

te_queue_size — number of requests waiting in the batch queue.
te_request_count — total requests served.
te_batch_inference_duration — histogram of batch inference times.
te_batch_inference_size — histogram of batch sizes.
te_request_duration — histogram of end-to-end request durations.

The most important metric for autoscaling is te_queue_size. It directly measures “are we under-provisioned?” — if the queue is growing, you need more replicas.

A typical KEDA ScaledObject for TEI:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: bge-embedder
spec:
  scaleTargetRef:
    name: bge-embedder
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: te_queue_size
        threshold: "10"
        query: te_queue_size{deployment="bge-embedder"}

This says: scale based on the queue size, targeting a queue length of 10 per replica. When the queue grows, KEDA adds replicas. When it shrinks, KEDA removes them.

KEDA watches te_queue_size via Prometheus and scales the TEI deployment when the queue exceeds the threshold, keeping batch latency bounded as load varies.

The threshold (10 in the example) is workload-specific. Tune it based on your latency requirements and the per-batch processing time.

This is fundamentally different from vLLM’s autoscaling metric (vllm:num_requests_running). For embeddings, the queue is the right signal because batches are short and uniform — a backlog directly translates to latency. For LLMs, the in-flight count is the right signal because batches have variable cost.

49.9 Operational considerations

A few production-relevant points:

(1) TEI scales horizontally well. Each TEI replica is independent (no shared state). Adding more replicas linearly increases throughput. Use enough to keep te_queue_size near zero.

(2) Cold starts are fast. Compared to vLLM (which loads multi-GB LLMs), TEI loads small embedders (a few hundred MB). Cold start is typically <30 seconds. This makes scale-to-zero more practical for low-traffic embedders.

(3) GPU is usually the bottleneck. Even small embedders saturate GPU compute at high QPS. For very high-QPS workloads, you may need multiple GPUs (one TEI per GPU) or CPU instances for some traffic.

(4) CPU support exists. TEI can run on CPU with significantly lower throughput. Useful for low-volume scenarios or for cost-optimized deployments where GPU is overkill.

(5) Per-tenant isolation. TEI doesn’t have built-in multi-tenant features. If you need per-tenant rate limiting or quota, do it at the gateway layer.

(6) Monitoring. Watch the queue size, the batch size distribution, and the end-to-end latency. The first tells you if you’re under-provisioned; the second tells you if your batching is working; the third tells you if users are seeing acceptable performance.

(7) Model loading. TEI downloads from HuggingFace Hub by default. For production, mirror the model to your own storage (S3, GCS) and configure TEI to load from there to avoid HF rate limits.

(8) Separate from LLM serving. TEI and vLLM are different services with different resource profiles. Don’t try to colocate them on the same GPU; run them on separate replicas.

49.10 The mental model

Eight points to take into Chapter 50:

TEI is a separate runtime because embedding models have completely different needs from LLMs.
No KV cache, no continuous batching. Just static batched encoder forward passes.
Rust + candle for low overhead and fast cold starts.
OpenAI-compatible API for both embeddings and reranking.
The autoscaling metric is te_queue_size — fundamentally different from vLLM’s vllm:num_requests_running.
Multimodal support is maturing for image+text embedders.
Scales horizontally well, fast cold starts, GPU-bound at high QPS.
Run TEI and vLLM as separate services. Don’t try to share GPUs.

In Chapter 50 we look at the layer in front of both vLLM and TEI: the AI gateway.

Read it yourself

The Text Embeddings Inference GitHub repository (huggingface/text-embeddings-inference).
The TEI documentation and API reference.
The TEI Docker images on GHCR.
The Hugging Face MTEB leaderboard for choosing embedding models.
Examples of TEI deployments in KServe via the community.

Practice

Deploy TEI locally with a small embedder (BAAI/bge-small-en-v1.5). Verify the /embed endpoint works.
Why doesn’t TEI support continuous batching like vLLM? Trace the request lifecycle for an embedding request and identify where continuous batching would or wouldn’t help.
Why is te_queue_size the right autoscaling metric for TEI but num_requests_running is the right one for vLLM? Compare the two workloads.
Set up two TEI instances: one for bge-base-en and one for bge-reranker-base. Build a simple retrieve-then-rerank pipeline against a small corpus.
Why are TEI cold starts faster than vLLM cold starts? List three reasons.
Why don’t you typically share GPUs between TEI and vLLM? Argue from the workload differences.
Stretch: Run a load test against TEI with varying concurrency. Plot throughput vs latency. Identify the saturation point.