Part III · Inference Internals & Production Serving
Chapter 26 Deep dive ~22 min read

Quantization: INT8, INT4, FP8, AWQ, GPTQ, SmoothQuant

"Quantization is the production compression technique. Distillation is the academic one"

We’ve covered quantization at a couple of places already — Chapter 13 (mixed precision for training), Chapter 18 (compression overview where we compared it to pruning and distillation). This chapter is the deep one. By the end you will know:

  • Why quantization is the dominant inference optimization in 2024–25.
  • The four-bit landscape: AWQ, GPTQ, bitsandbytes-NF4, and how they differ.
  • The eight-bit landscape: INT8 with SmoothQuant, FP8, and the H100 hardware story.
  • KV cache quantization, separately from weight quantization.
  • How to pick a scheme for a deployment.
  • The quality-vs-throughput frontier for current models.

Outline:

  1. Why quantize for serving.
  2. The fundamentals: scale, zero-point, symmetric vs asymmetric.
  3. Per-tensor, per-channel, per-group quantization.
  4. Post-training vs quantization-aware training.
  5. The INT8 family — naive, SmoothQuant, LLM.int8().
  6. The FP8 family — e4m3, e5m2, the H100 story.
  7. The INT4 family — AWQ, GPTQ, NF4.
  8. KV cache quantization.
  9. The activation outlier problem.
  10. Picking a scheme.

26.1 Why quantize for serving

Quantization replaces the high-precision data type of a tensor with a lower-precision one. For inference, this gives you three benefits:

(1) Memory. A 70B model in bf16 is 140 GB; in int8 it’s 70 GB; in int4 it’s 35 GB. You can fit more model in less memory, which means smaller hardware budgets for the same model, and more KV cache budget alongside the model weights.

(2) HBM bandwidth. Recall from Chapter 21 that decode is HBM-bandwidth-bound. The minimum decode latency per token is model_size / HBM_bandwidth. Quantizing the weights reduces the model size, which reduces the bandwidth needed per step. A 70B model in int4 has 1/4 the bandwidth requirement of bf16, which means 4× faster decode.

(3) Tensor Core throughput. Modern GPUs have specialized Tensor Cores for low-precision matmul. INT8 throughput on H100 is 2× bf16; FP8 is also 2× bf16; INT4 doesn’t have native Tensor Core support but can still run faster through specialized kernels like Marlin. Quantization isn’t just memory; it’s also more FLOPs.

The combination of (1), (2), and (3) makes quantization the most cost-effective single optimization for inference.

Memory and minimum decode latency for Llama 70B at different precisions on H100. dtype → smaller dtype = lower memory + lower decode latency bf16 140 GB weight · 47 ms/token min latency fp8 70 GB · 23 ms/tok int4 35 GB · 12 ms/tok ← 2× ← 4×
INT4 quantization reduces the minimum decode latency from 47 ms/token to ~12 ms/token on H100 by shrinking the weight tensor the GPU must stream — the entire gain is from reduced HBM bandwidth pressure.

Compare to alternatives:

  • Pruning (Chapter 18) is harder, gives smaller wins, requires retraining, and doesn’t speed up dense matmul.
  • Distillation (Chapter 18) requires training a separate model, takes weeks, and the quality is unpredictable.
  • Smaller model (use a 7B instead of a 70B) often has unacceptable quality regression.

Quantization is the go-to. The question is just: which scheme, at what precision, with what quality cost?

26.2 Fundamentals

The basic operation: take a floating-point tensor and convert each element to a low-precision integer (or a low-precision float). The mapping is parameterized by a scale (and optionally a zero-point):

quantized_value = round(float_value / scale) + zero_point
float_value     ≈ (quantized_value - zero_point) × scale

The scale is a floating-point number that determines the dynamic range of the quantized representation. A bigger scale lets you represent bigger values but with less precision; a smaller scale gives more precision over a smaller range.

Symmetric quantization uses zero-point = 0 (so positive and negative values are mirrored around zero). The dynamic range is [-127 × scale, 127 × scale] for int8. Symmetric is simpler and has slightly better hardware support; almost all serious quantization is symmetric.

Asymmetric quantization allows a non-zero zero-point. The range is [(0 - zp) × scale, (255 - zp) × scale] for uint8. More flexible for distributions that aren’t centered on zero, but more complex.

The dequantization step turns the quantized integer back into an approximate float. During inference, the matmul can be done in two ways:

Quantization number line: float values are mapped to the nearest integer level; scale controls the spacing between levels; values beyond the range clip to the boundary. -127 0 +127 ← integer quantization levels (INT8 symmetric) → float value → rounds to this level ← scale → clip clip
Quantization maps float values to the nearest integer level on a uniform grid; values outside the range clip to the boundary, and the scale determines how many float units each step represents — a smaller scale means finer precision but a smaller representable range.
  1. Dequantize-then-multiply. Read the int weights from HBM, dequantize to bf16, do a bf16 matmul. Saves HBM bandwidth (read int instead of bf16) but doesn’t save compute.
  2. Native low-precision matmul. Do the matmul in the low precision directly using Tensor Cores. Saves both HBM bandwidth and compute. Requires hardware support.

INT8, FP8, and INT4 (with the right kernel) all support native low-precision matmul on modern GPUs.

26.3 Per-tensor, per-channel, per-group

The “scale” we introduced above is one number per tensor. This is per-tensor quantization. It’s the simplest, and it has the biggest quality cost because one scale has to fit the entire dynamic range of the tensor — including outliers, which we’ll discuss in §26.9.

The fix is to use multiple scales. The granularity options:

  • Per-tensor. One scale per tensor. Simplest, biggest quality cost.
  • Per-channel. One scale per output channel (i.e., per row of the weight matrix). Captures variation across output dimensions. Standard for INT8 weight quantization.
  • Per-group. One scale per group of weights within a row. Each row is divided into groups of (typically) 64 or 128 elements, each with its own scale. Standard for INT4 quantization (AWQ, GPTQ).
  • Per-element. One scale per weight. Equivalent to floating-point — defeats the purpose.

The trade-off: more scales = better quality (because each scale fits a smaller range and can be more precise) but more storage (you have to store the scales) and more compute (the dequantization needs to look up the right scale for each element).

Quantization granularity: per-tensor uses one scale for all weights; per-channel uses one per output row; per-group uses one per 64-128 element window within a row. Per-tensor 1 scale total s simple, least precise Per-channel 1 scale per row s₀ s₁ s₂ s₃ good for INT8 Per-group (g=2) 1 scale per group within row standard for INT4 (AWQ/GPTQ)
Per-group quantization (group size 64–128) is the production standard for INT4 — it gives each window of weights its own scale so outlier groups don't destroy precision for the whole row, with only ~3% storage overhead for the extra scales.

For INT4, per-group at group size 64 or 128 is essentially the standard. The storage overhead of the scales is small — 1 fp16 scale per 64 weights = ~3% overhead.

26.4 Post-training vs quantization-aware training

Post-training quantization (PTQ) takes an already-trained bf16 model and converts it to int8 / int4 / fp8 without further training. You compute the scales from the trained weights (and from a small “calibration set” of representative inputs to capture the activation distributions), apply the quantization, and ship.

PTQ is what you actually want for production. It’s:

  • Cheap. A few hours of compute to quantize a model.
  • Simple. No retraining loop, no hyperparameter tuning, no eval-during-training.
  • Reversible. Keep the bf16 checkpoint as the source of truth and re-quantize when you change schemes.

The cost is some quality loss vs the bf16 baseline. For modern schemes (AWQ, FP8, SmoothQuant), the quality loss is small — typically 0.5–2 percentage points on standard benchmarks.

Quantization-aware training (QAT) retrains the model with simulated quantization in the forward pass. The model “learns” to be robust to the quantization, and the resulting quality is closer to bf16. QAT is more expensive (it requires a fine-tuning run) and more complex (the simulated quantization has to be differentiable).

QAT is sometimes used at the very low precisions (sub-INT4) where PTQ struggles. For INT4 and above, PTQ is usually good enough.

For production, start with PTQ. Move to QAT only if PTQ’s quality cost is unacceptable for your use case.

26.5 The INT8 family

INT8 quantization was the first wave of LLM compression and is still widely used. The mainstream techniques:

Naive INT8

Per-tensor or per-channel symmetric quantization with a calibration step. Compute the max absolute value of each weight tensor (and each activation tensor based on calibration), use it as the scale, quantize.

Naive INT8 works well for convolutional networks but has problems for transformers. The reason is activation outliers: certain channels in transformer activations have values 10–100× larger than others, and a per-tensor scale based on the max absolute value gives terrible precision for the typical channels. The model loses several points of quality.

LLM.int8() (Dettmers et al., 2022)

The first paper to make INT8 work well for large transformers. The trick: decompose the matmul into two parts. The “outlier channels” (channels with values above a threshold) are kept in fp16; the “inlier channels” are quantized to int8. The two partial matmuls are computed separately and summed.

LLM.int8() preserves quality essentially perfectly (within 0.1% of fp16) but the decomposition adds overhead. It’s used in bitsandbytes for INT8 inference and is the default in HuggingFace transformers when you do load_in_8bit=True.

SmoothQuant (Xiao et al., 2022)

A different fix: instead of decomposing the matmul, smooth the activations before quantization. SmoothQuant observes that the weights are uniform in magnitude but the activations have outliers. It applies a per-channel scaling that shifts magnitude from activations to weights in a way that’s algebraically equivalent. The result is smoother activations and weights that have a slightly wider range — both can now be quantized cleanly to INT8.

SmoothQuant is more efficient than LLM.int8() because it doesn’t need the decomposition. It’s the standard for high-throughput INT8 LLM serving.

INT8 in 2025: SmoothQuant is the dominant approach. Quality is essentially fp16 baseline; throughput is roughly 2× bf16 on H100; memory is 2× smaller. A solid mid-precision option when you don’t want to go all the way to INT4.

26.6 The FP8 family

FP8 is the new kid on the block, enabled by Hopper (H100) hardware. Recall from Chapter 13 the two FP8 formats:

  • e4m3 (4 exponent bits, 3 mantissa bits): max value 448. Used for activations and weights during inference.
  • e5m2 (5 exponent bits, 2 mantissa bits): max value 57,344. Used for gradients during training.

For inference, you almost always use e4m3 for the weights and activations. The much wider range is wasted on inference, where you’re not dealing with backward-pass gradients.

The FP8 quantization story is straightforward:

  1. Compute per-tensor (or per-channel) scales for weights and activations.
  2. Quantize weights to e4m3 (offline, once).
  3. At inference time, quantize activations to e4m3 on the fly using the calibrated scales.
  4. Run matmul in fp8 native (Hopper Tensor Cores have fp8 support).

The quality cost of fp8 is very small, often within 0.1% of fp16/bf16 baseline. The reason: fp8 has more dynamic range than int8 (it has an exponent), so it handles activation outliers gracefully without needing tricks like LLM.int8 decomposition or SmoothQuant smoothing.

The throughput benefit on Hopper:

  • bf16 → fp8: 2× more Tensor Core throughput.
  • bf16 → fp8: 2× less HBM bandwidth.
  • Combined: roughly 2× faster inference for compute-bound and memory-bound regimes alike.

FP8 is the modern default for H100 serving when supported. Frameworks: TensorRT-LLM has first-class FP8 support; vLLM’s FP8 support has matured significantly; SGLang supports it. The trick is that not every kernel has FP8 implementations — some operations still happen in bf16 or fp32, with fp8 only for the matmuls.

Limitations:

  • Hopper only. Older GPUs (Ampere, V100) don’t have fp8 hardware. You can simulate it but you don’t get the speedup.
  • Calibration matters. The quality is sensitive to the calibration set used to compute the scales.
  • Dynamic vs static scales. Static scales (computed once at deployment) are simpler; dynamic scales (re-computed per batch) are more accurate but more expensive.

For new H100/H200 deployments: try FP8 first. It’s the cleanest option.

26.7 The INT4 family

INT4 is the most aggressive practical quantization for LLMs. It cuts memory and bandwidth to 1/4 of bf16. The quality cost is moderate — typically 1–3 points on standard benchmarks for a 70B model — and the throughput gain is large.

The three main schemes:

AWQ (Activation-aware Weight Quantization, Lin et al., 2023)

The core insight: most of the quality loss in low-bit quantization comes from a small fraction of “important” weights. AWQ identifies these by looking at which weights interact with high-magnitude activations (because errors there matter more), and it uses per-channel scaling to preserve those weights’ precision.

Specifically, AWQ:

  1. Calibrates with a small dataset to find which channels have large activations.
  2. Scales the weights in those important channels up before quantization (so the scale is finer-grained for them) and scales the corresponding activations down (algebraically equivalent).
  3. Quantizes everything to INT4 with per-group scales.

The result is a model where the “important” weights (the ones that drive the most output variance) are preserved at high effective precision, while the rest are aggressively quantized.

AWQ is the most commonly deployed INT4 scheme for LLMs as of 2024–25. It’s supported by vLLM, SGLang, TensorRT-LLM, and others. Quality is typically within 1–2 points of bf16.

GPTQ (Frantar et al., 2022)

The other main INT4 scheme. GPTQ uses second-order information (Hessian-based) to compute optimal quantization scales. It’s mathematically more sophisticated than AWQ and was the first INT4 scheme to demonstrate near-bf16 quality at the 4-bit level.

GPTQ is a per-layer algorithm — it processes one layer at a time, computing the optimal quantization for that layer given the inputs the layer sees in calibration. The calibration is more expensive than AWQ’s (it requires running the model on calibration data layer by layer).

In practice, AWQ and GPTQ produce similar quality on modern models. AWQ is slightly faster to apply and is the more common production choice; GPTQ has slightly better quality on some models.

NF4 (NormalFloat 4-bit, from QLoRA, Dettmers et al., 2023)

We covered this in Chapter 15 (QLoRA). NF4 is a custom 4-bit format whose discrete values are chosen based on the empirical distribution of pretrained weights (which is approximately normal). NF4 has slightly better quality than uniform INT4 at the same bit width.

NF4 is mostly used for fine-tuning (QLoRA) rather than inference, because the bitsandbytes implementation is slower than dedicated inference kernels. For pure inference, AWQ is generally faster.

Other variants

  • GPTQ + Marlin kernel: GPTQ-quantized weights served with the Marlin INT4 kernel from vLLM. Currently one of the fastest INT4 inference paths.
  • EXL2: a more flexible scheme that uses variable bits per layer based on importance. Used in exllamav2 (a community serving library for consumer GPUs).
  • HQQ: half-quadratic quantization. Newer, claims competitive quality at lower calibration cost.

26.8 KV cache quantization

So far we’ve talked about quantizing the weights. The other big tensor in inference is the KV cache (Chapter 22), which can be larger than the weights for long contexts. KV cache quantization is a separate technique with its own characteristics.

The standard approach: quantize the K and V tensors to INT8 (or sometimes INT4) when storing them in the cache. When attention reads them, it dequantizes back to bf16 in SRAM and runs the attention computation in bf16.

The benefits:

  • 2× smaller KV cache with INT8, 4× smaller with INT4.
  • More concurrent users fit on the same GPU.
  • Longer contexts fit in the same KV budget.

The costs:

  • A small amount of quality loss (much less than weight quantization, because the KV cache is downstream of the model and small errors don’t compound as badly).
  • Some compute overhead for dequantization in the attention kernel.

vLLM supports KV cache quantization via the --kv-cache-dtype flag. You can set it to fp8, int8, or int4. SGLang has similar support.

The non-obvious thing: KV cache quantization is independent of weight quantization. You can have bf16 weights with INT8 KV cache, or AWQ weights with FP8 KV cache, or any other combination. They’re separate decisions.

For long-context workloads (32k+ tokens), KV cache quantization is often worth the quality cost. For short-context workloads where the cache isn’t dominant, weight quantization is more impactful.

26.9 The activation outlier problem

A key technical challenge that runs through all the INT8/INT4 schemes: activation outliers.

In transformer activations, certain dimensions have values that are 10–100× larger than the typical magnitude.

Activation outlier problem: a per-tensor scale must span the outlier range, leaving typical values with almost no quantization precision. Per-tensor INT8 scale destroyed by one outlier channel -128 +127 0 typical values ±3 only 4 out of 127 int8 steps used! +100 outlier scale = 100/127 — all 127 steps used for range [0..100] SmoothQuant / AWQ / GPTQ all exist to stop this outlier from destroying quantization precision for normal channels
A single outlier activation channel forces the per-tensor quantization scale to span the full outlier range, leaving typical channels quantized to fewer than 5 distinct INT8 levels — destroying model quality.
For example, after the first few transformer layers, you might have activation values where 99% of the channels are in [-3, 3] but a handful are in [-100, 100].

Why does this matter for quantization? Because a per-tensor scale has to span the entire range. If the max is 100, the scale is set so that 100 maps to the largest int8 value (127). Then a typical value of 3 maps to about 3.8 — which gets rounded to 4. The precision in the typical range is destroyed by the outliers.

The various INT8 fixes (LLM.int8(), SmoothQuant) and the INT4 fixes (AWQ, GPTQ) are all different ways to handle this:

  • LLM.int8(): separate the outliers and quantize them in fp16, the rest in int8.
  • SmoothQuant: shift magnitude from activations to weights via per-channel scaling.
  • AWQ: identify the channels driven by high-magnitude activations and preserve their precision.
  • GPTQ: use second-order information to optimize scales globally, accounting for activation magnitudes.

These are all responses to the same underlying issue. Modern quantization research is largely about better and cheaper outlier handling.

The reason FP8 has been such a clean win is that FP8’s exponent gives it dynamic range that handles outliers naturally. The whole class of “outlier handling” tricks needed for INT8/INT4 is much less critical for FP8. This is one of the reasons the field is shifting toward FP8 on hardware that supports it.

26.10 Picking a scheme

The decision tree:

graph TD
  HW{Hardware?}
  HW -->|H100/H200/B200| FP8[Try FP8 first]
  HW -->|A100| INT8[INT8 SmoothQuant or AWQ INT4]
  HW -->|V100/T4| V100INT8[INT8 SmoothQuant]
  HW -->|Consumer GPU| AWQ[AWQ INT4 + bitsandbytes]
  FP8 --> Mem{Memory tight?}
  Mem -->|No| BF16[bf16 — no quantization]
  Mem -->|Some pressure| FP8done[FP8 weights + FP8 KV]
  Mem -->|Very tight| INT4[AWQ INT4 + FP8 KV cache]
  style FP8done fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style INT4 fill:var(--fig-accent-soft),stroke:var(--fig-accent)

For most 2025 H100 deployments the optimal path lands on FP8 weights with FP8 KV cache — clean, fast, and essentially no quality loss.

Q: What hardware are you on?

  • H100/H200/B200: FP8 first. Fall back to AWQ INT4 if you need more memory.
  • A100: INT8 (SmoothQuant) or AWQ INT4.
  • V100/T4: INT8 (SmoothQuant). FP8 isn’t supported.
  • Consumer (4090, etc.): AWQ INT4 with bitsandbytes or specialized kernels (EXL2, GGUF).

Q: How much memory pressure?

  • Lots of memory headroom: bf16, no quantization.
  • Some pressure: INT8 (~2× smaller) or FP8.
  • Very tight: INT4 (~4× smaller).

Q: What quality budget do you have?

  • Essentially zero quality loss: bf16, FP8.
  • Small loss (< 1 point): INT8 SmoothQuant, FP8 with calibration.
  • Moderate loss (1–3 points): AWQ INT4, GPTQ INT4.
  • Larger loss acceptable: INT4 with smaller group sizes, more aggressive schemes.

Q: What workload?

  • Long context (32k+): KV cache quantization is worth it.
  • Short context: weight quantization is the priority.
  • Multi-tenant LoRA: quantize the base, keep adapters in bf16.
  • Reasoning models with very long generations: KV cache quantization is critical.

For the typical 2025 production deployment of a 70B model on H100s: AWQ INT4 weights + FP8 KV cache is a strong default. Throughput is high, memory is tight, quality is within 1–2 points of bf16. For a 70B on B200 with more memory headroom, FP8 weights + FP8 KV cache is cleaner.

26.11 The mental model

Eight points to take into Chapter 27:

  1. Quantization is the dominant inference compression technique. Memory, bandwidth, and compute all benefit.
  2. Per-channel and per-group quantization are needed for transformers because of activation outliers. Per-tensor is too coarse.
  3. PTQ is what you want for production. QAT only when PTQ isn’t enough.
  4. INT8 with SmoothQuant is the mid-precision standard.
  5. FP8 is the H100+ default. Cleanest, fastest, most quality-preserving.
  6. INT4 (AWQ or GPTQ) is the aggressive option. 4× smaller, 1–3 points of quality cost.
  7. KV cache quantization is a separate decision, important for long-context workloads.
  8. Activation outliers are the deep technical challenge that all these schemes are addressing.

In Chapter 27 we look at speculative decoding — the technique that lets you generate more than one token per forward pass.


Read it yourself

  • Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022).
  • Xiao et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022).
  • Lin et al., AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (2023).
  • Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022).
  • The NVIDIA FP8 inference whitepaper.
  • The vLLM documentation on quantization (vllm/docs/source/quantization/).
  • The bitsandbytes README on GitHub.

Practice

  1. Compute the memory of a 70B model in bf16, fp8, int8, and int4. Compute the minimum decode latency for each on an H100 (3 TB/s HBM).
  2. Why does per-channel quantization help INT8 transformers more than per-tensor? Construct a tiny example with one outlier channel.
  3. What’s the difference between INT8 with SmoothQuant and INT8 with LLM.int8()? Which is faster, and why?
  4. Why does FP8 handle activation outliers better than INT8? Construct a numerical example.
  5. AWQ uses per-channel scaling. Walk through the math: how does scaling weights up by s and activations down by s preserve the matmul output?
  6. KV cache quantization is independent of weight quantization. Argue why each is better suited for different workloads.
  7. Stretch: Take a small open model and quantize it with AWQ (use the autoawq library). Compare the quality on a small benchmark to the bf16 baseline. Measure the throughput gain on a single GPU.

Concept check

4 questions. Click a choice to check. Your score is saved locally.

Score
0 / 4
  1. 1. A 70B model in INT4 has roughly 4x faster minimum decode latency than the same model in BF16. What is the direct mechanism for this speedup?
  2. 2. AWQ (Activation-Aware Weight Quantization) produces better quality at INT4 than naive round-to-nearest quantization. What is its key insight?
  3. 3. SmoothQuant addresses the INT8 activation outlier problem by migrating quantization difficulty from activations to weights. Concretely, what does it do?
  4. 4. KV cache quantization is applied separately from weight quantization. Why might INT4 KV cache quantization be riskier than INT4 weight quantization at the same bit-width?
Related chapters