Chapter 44: The serving framework landscape: vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, MLC, Triton

We open Stage 4 of Part III — production model serving — with the comparative tour of the major LLM serving frameworks. By the end of Part III you’ll know how to deploy any of these in production, what each is best at, and which one you’d reach for in any given scenario. This chapter is the map.

The frameworks differ in their kernel quality, scheduling approach, hardware support, operational complexity, and feature surface. There is no single “best” framework — each occupies a niche. The skill is matching the framework to the workload.

Outline:

The serving stack landscape.
vLLM — the open-source default.
SGLang — the SGL programming model and RadixAttention.
TensorRT-LLM — NVIDIA’s official.
TGI — Hugging Face’s framework.
llama.cpp — the consumer / edge default.
MLC-LLM — TVM-based portability.
Triton Inference Server — the orchestration layer.
The decision matrix.

44.1 The serving stack landscape

A production LLM serving stack has a few layers:

The runtime — the engine that runs the model. vLLM, SGLang, TensorRT-LLM, llama.cpp, etc.
The orchestration — how multiple model replicas, multiple models, and multiple GPUs are managed. KServe, BentoML, Triton Inference Server.
The gateway — the API layer that routes requests to the right runtime. Envoy AI Gateway, custom proxies.
The infra — Kubernetes, autoscalers, observability, etc.

This chapter covers the runtimes. Chapter 45 covers orchestration. Chapter 50 covers gateways.

The runtimes differ in their internal architecture, kernel choices, scheduling, and supported hardware. Picking one is a foundational decision because the runtime determines:

Which models you can serve.
What throughput you can achieve.
What the operational complexity looks like.
What the latency profile is.
Which optimizations you get for free.

Switching runtimes after the fact is expensive. Pick wisely.

The serving stack has four layers; this chapter focuses on the runtime layer (highlighted) — the engine that actually runs the model on the GPU.

44.2 vLLM — the open-source default

vLLM (Kwon et al., 2023) is the most widely-used open-source LLM serving framework as of late 2025. It was the framework that introduced PagedAttention (Chapter 24) and continuous batching (Chapter 23), and it’s been the de facto standard ever since.

The strengths:

PagedAttention + continuous batching. The original implementation, mature and well-tested.
Wide model support. Llama, Mistral, Qwen, DeepSeek, Phi, Gemma, and dozens of other architectures all work out of the box.
Wide hardware support. NVIDIA (Volta, Turing, Ampere, Hopper), AMD (MI200, MI300), some Intel GPUs, AWS Inferentia, TPU support in development.
Multi-tenant LoRA serving. First-class support for serving multiple LoRA adapters on the same base.
Speculative decoding support.
Quantization support. AWQ, GPTQ, FP8, INT8, and various combinations.
Prefix caching (Chapter 29).
Disaggregated PD (Chapter 36) since v0.5+.
Open source. Apache 2.0, active community, frequent releases.

The weaknesses:

Not always the fastest. TensorRT-LLM is often faster on NVIDIA hardware for the optimal case.
Operational complexity. Many configuration knobs; tuning is its own skill.
Newer features can be unstable. New releases sometimes regress on edge cases.
Memory tuning is fragile. Getting the GPU memory utilization and KV cache size right is harder than it should be.

vLLM is the right default for almost everyone. Unless you have a specific reason to pick something else, start with vLLM. Most production LLM deployments in 2024-25 use it.

44.3 SGLang — the SGL programming model and RadixAttention

SGLang (Zheng et al., 2023) is a serving framework with two key ideas:

The SGLang programming model, which lets you express complex multi-step LLM programs (with branching, loops, parallel calls) as Python code that the runtime can optimize across calls.
RadixAttention (Chapter 29), an efficient prefix cache that reuses KV cache across requests at the radix-tree level.

The strengths:

RadixAttention is faster than vLLM’s prefix cache for workloads with deeply nested shared prefixes.
The SGLang programming model lets you write multi-step LLM workflows with automatic optimization. Useful for agent loops, complex prompt chains, structured generation pipelines.
Fast attention kernels for Hopper.
Strong support for structured generation (Chapter 43).
Active development with frequent improvements.

The weaknesses:

Smaller community than vLLM. Less battle-tested in production.
Fewer model architectures supported (catching up).
The SGLang programming model is opinionated; you may not need it.
Less mature operational tooling.

SGLang is the right choice when:

You have heavy prefix sharing (e.g., RAG with long shared contexts).
You’re writing complex multi-step LLM programs and want them automatically optimized.
You’re doing structured generation extensively.
You’re comfortable adopting a less-mature alternative for the performance benefits.

For most production use, SGLang is competitive with vLLM and gaining ground. Some teams have switched to SGLang for prefix-cache-heavy workloads.

44.4 TensorRT-LLM — NVIDIA’s official

TensorRT-LLM is NVIDIA’s official LLM inference framework, built on top of TensorRT (NVIDIA’s general-purpose inference compiler). It’s developed by NVIDIA’s engineering team and gets first access to new hardware features.

The strengths:

Often the fastest on NVIDIA hardware. NVIDIA’s kernel team has the most resources.
First access to new hardware features. Hopper FP8, Blackwell features, etc. show up here first.
CUDA Graphs for low-overhead launches.
Aggressive kernel fusion via TensorRT’s compiler.
Production-grade. Used by many of NVIDIA’s biggest customers.

The weaknesses:

NVIDIA-only. No AMD, no Intel, no anything else.
Closed-source compiler. Some parts are open, but the core compiler is proprietary.
Operational complexity. TensorRT-LLM requires building per-model engines (an offline compile step) before serving. The engines are tied to the GPU type and the configuration.
Less flexible than vLLM. Adding new features or supporting new architectures takes longer.
More features in vLLM. Multi-LoRA, continuous batching, and other features arrived in vLLM first; TensorRT-LLM caught up later.

TensorRT-LLM is the right choice when:

You’re committed to NVIDIA hardware.
You need the absolute maximum performance.
You have the operational sophistication to manage the engine compilation pipeline.
You’re using H100/H200/B200 and want first-class fp8 support.

For most teams, the performance gain over vLLM (typically 10-30%) doesn’t justify the operational overhead. For frontier labs and large-scale NVIDIA deployments, TensorRT-LLM is often the right choice.

44.5 TGI — Hugging Face’s framework

Text Generation Inference (TGI) is Hugging Face’s open-source serving framework. It was an early competitor to vLLM and is still actively maintained.

The strengths:

Tight Hugging Face integration. Models from the HF Hub work out of the box.
Good operational tooling. Built by Hugging Face’s serving team for production use.
Continuous batching (after vLLM popularized it).
Multi-tenant support.
Stable and well-documented.

The weaknesses:

Generally slower than vLLM for the same workload.
Smaller community for new features and contributions.
Less aggressive optimization. Not the framework you reach for if you want the cutting edge.

TGI is a reasonable choice when:

You’re already deeply invested in the Hugging Face ecosystem.
You want stable, conservative tooling.
You don’t need the absolute fastest performance.

For most teams, vLLM or SGLang has overtaken TGI as the open-source default. TGI is still in use but losing market share.

44.6 llama.cpp — the consumer / edge default

llama.cpp (Gerganov, 2023) is a from-scratch C++ inference engine, originally for CPUs but now with GPU support. It uses a custom quantization format (.gguf) and is the dominant choice for consumer / local LLM inference.

The strengths:

Runs on anything. CPU, CUDA, Metal (Apple), Vulkan, OpenCL, Hip (AMD).
Excellent on Apple Silicon. The Metal backend is highly optimized for M1/M2/M3/M4.
Custom quantization formats. GGUF supports a wide variety of quantization schemes (Q4_K_M, Q5_K_S, Q8_0, etc.) with very compact storage.
No framework dependencies. No PyTorch, no Python. Just C++ and the model weights.
Easy to embed. Can be linked into other applications as a library.
Active community. Used by Ollama, LM Studio, llama-cpp-python, and many other tools.

The weaknesses:

Slower than vLLM on NVIDIA. llama.cpp isn’t designed for maximum throughput on datacenter GPUs.
Limited multi-tenant support. Designed for single-user / single-request scenarios.
No advanced features like prefix caching (some support is being added) or speculative decoding (early support).
Custom quantization formats mean you have to convert HF models to GGUF.

llama.cpp is the right choice for:

Consumer / local LLM serving. Ollama and LM Studio both use it under the hood.
Edge deployment. Running on phones, laptops, embedded devices.
Minimal-dependency deployments. When you can’t install PyTorch.
Mac users. The Metal backend is the best LLM inference on Apple Silicon.

For datacenter LLM serving, llama.cpp is the wrong tool. For everywhere else, it’s often the best.

44.7 MLC-LLM — TVM-based portability

MLC-LLM (MLC AI team) is a serving framework built on TVM (Chapter 38). The pitch: write the model once, compile it for any hardware target.

The strengths:

Portable. Same model code targets NVIDIA GPUs, AMD GPUs, Intel GPUs, mobile, web (WebGPU), etc.
Mobile support. Runs LLMs on iPhones and Android devices natively.
Browser support. WebLLM uses MLC-LLM to run models in the browser via WebGPU.
Compilation pipeline. TVM optimizes the model for the target hardware.

The weaknesses:

Slower than vLLM on NVIDIA. TVM-generated kernels are slower than hand-tuned CUTLASS kernels.
Smaller ecosystem. Less battle-tested than vLLM.
Compilation step is required. You can’t just load weights and serve; you have to compile first.

MLC-LLM is the right choice for:

Mobile / web LLM deployment.
Cross-platform (NVIDIA + AMD + others) deployments.
When portability matters more than peak performance.

For pure datacenter NVIDIA serving, MLC-LLM doesn’t compete. For multi-platform or mobile/web, it’s often the only option.

44.8 Triton Inference Server — the orchestration layer

A name confusion: Triton Inference Server (NVIDIA) is completely different from Triton the GPU kernel DSL (also NVIDIA, but a different team and project). The kernel Triton lives in triton-lang.org; Triton Inference Server lives in github.com/triton-inference-server/server.

Triton Inference Server is a general-purpose inference server. It’s not LLM-specific:

Supports many backends: TensorRT, TensorRT-LLM, vLLM, ONNX, PyTorch, TensorFlow, Python custom backends.
Provides a unified HTTP/gRPC API for all of them.
Handles model versioning, multi-model serving, dynamic batching (for non-LLM models), and metrics.
Used heavily in NVIDIA’s ecosystem and in MLOps platforms.

Triton Inference Server is the right choice when:

You need to serve LLMs alongside other ML models (classifiers, embedders, etc.) under a unified API.
You’re already using NVIDIA’s MLOps stack.
You want a stable, mature orchestration layer.

For pure LLM serving, you can use vLLM directly without Triton Inference Server. For mixed workloads or large MLOps deployments, Triton Inference Server is a reasonable orchestration choice.

44.9 The decision matrix

Putting it all together. The framework decision matrix for late 2025:

No single framework wins on all dimensions — vLLM is the balanced default; SGLang leads on prefix caching; TensorRT-LLM leads on raw NVIDIA throughput; llama.cpp and MLC-LLM are the only real choices for edge and mobile.

Use case	Recommended framework
Datacenter LLM serving on NVIDIA, default	vLLM
Heavy prefix sharing or multi-step LLM programs	SGLang
Maximum performance on NVIDIA, willing to invest	TensorRT-LLM
Hugging Face ecosystem, conservative tooling	TGI
Consumer / local / mobile / Apple Silicon	llama.cpp
Web / browser deployment	MLC-LLM (via WebLLM)
Multi-platform (NVIDIA + AMD + …)	MLC-LLM or vLLM (depending on what you need)
Mixed ML workloads (LLMs + classifiers)	Triton Inference Server with vLLM backend
AMD MI300 deployment	vLLM (best AMD support)
Hopper FP8	TensorRT-LLM or vLLM (both support it)
Long context with shared prefixes	SGLang
Latency-critical single-user	TensorRT-LLM
Throughput-critical multi-user	vLLM

For most readers, the answer is vLLM, with SGLang as the alternative for prefix-cache-heavy workloads and TensorRT-LLM for the maximum-performance NVIDIA case.

44.10 Why the answer is usually vLLM

The honest assessment: vLLM has become the default because it’s good enough at everything.

The performance is competitive with (though not always beating) TensorRT-LLM.
The model support is the widest in the open-source community.
The hardware support is broad (NVIDIA, AMD, Intel, AWS).
The community is the largest, which means features arrive fast and bugs get fixed.
The operational story is the most documented.
The defaults are good enough to deploy without deep tuning.

vLLM has won the open-source LLM serving race not because it’s the best at any single thing, but because it’s the best overall. The “standardize on vLLM” decision is the safe one for most teams.

The exceptions are when you have a specific need that vLLM doesn’t handle well — heavy prefix caching (SGLang), maximum NVIDIA performance (TensorRT-LLM), local/consumer (llama.cpp), web/mobile (MLC-LLM). In those cases, the alternative is the right call.

For most production LLM deployments at most companies, just use vLLM. Tune it carefully (Chapter 48), monitor it well (Chapter 53), and call it done.

44.11 The mental model

Eight points to take into Chapter 45:

vLLM is the default. Use it unless you have a specific reason not to.
SGLang wins on prefix-cache-heavy workloads and structured generation.
TensorRT-LLM is the maximum-performance NVIDIA choice. Operational complexity is real.
TGI is stable and HF-integrated but losing ground to vLLM.
llama.cpp dominates consumer / local / mobile / Apple.
MLC-LLM for portable / web / mobile.
Triton Inference Server for orchestration of mixed ML workloads.
The decision matrix is clear once you know the workload. Match the framework to the workload, not the workload to the framework.

In Chapter 45 we look at the orchestration layer above the runtime: KServe, BentoML, and the rest.

Read it yourself

The vLLM GitHub repository and documentation.
The SGLang paper (Zheng et al., 2023) and the GitHub repository.
The TensorRT-LLM GitHub repository and documentation.
The TGI GitHub repository.
The llama.cpp GitHub repository.
The MLC-LLM documentation and the WebLLM demos.
The Triton Inference Server documentation.

Practice

Pick three LLM serving frameworks from this chapter and identify the workloads where each wins. Justify in one sentence each.
Why has vLLM become the default? List three reasons.
When would you choose SGLang over vLLM? Construct a specific use case.
Why is TensorRT-LLM faster than vLLM in some cases but harder to deploy?
Why does llama.cpp dominate the consumer/local LLM market while being uncompetitive in datacenter serving?
What’s the difference between Triton (the kernel DSL) and Triton Inference Server (the framework)? Why is the naming confusing?
Stretch: Set up vLLM and SGLang on the same hardware with the same model. Run a benchmark with shared-prefix prompts and compare prefix cache hit rates.