Chapter 34: Mixture of Experts: routing, balancing, and the inference cost story

In Chapter 33 we shrunk the KV cache via attention compression. In this chapter we shrink the per-token compute via Mixture of Experts (MoE): a model where each token only activates a small fraction of the parameters, while the total parameter count remains huge.

MoE is one of the most consequential architectural ideas of the past few years. Mixtral, DeepSeek-V3, GPT-4 (rumored), Claude (rumored), Gemini, and many others use it. Understanding MoE — what it gives you, what it costs operationally, and when to use it — is essential for senior ML systems work.

This chapter covers the architecture, the training story, the routing problem, the load balancing problem, and the inference cost picture that makes MoE simultaneously the most exciting and the most operationally annoying architecture choice.

Outline:

The dense vs sparse trade.
The basic MoE architecture.
Routing — top-k gating.
Load balancing.
Auxiliary losses and the load balancing loss.
The inference cost story — total params vs active params.
Expert parallelism for serving.
Why MoE serving is harder than dense serving.
The current state — Mixtral, DeepSeek-V3, fine-grained MoE.

34.1 The dense vs sparse trade

A “dense” transformer applies every parameter to every token. A 70B-parameter dense model uses 70B parameters of compute on every input token. The model’s capacity (what it can learn and remember) is bounded by its parameter count, and so is its inference cost.

A “sparse” transformer applies only a subset of its parameters to each token. A 670B-parameter MoE model might use only 37B parameters of compute on each token (the rest are “experts” that aren’t activated). The model has a much higher total parameter count — and therefore much more total capacity — but the per-token compute cost is closer to the smaller “active” parameter count.

The trade:

Dense: simpler, predictable cost, smaller total capacity per dollar of compute.
Sparse: complex, variable cost, much larger total capacity per dollar of compute.

MoE replaces the dense FFN with many parallel expert networks; only the top-k routed experts contribute to each token's output, making per-token compute proportional to active parameters rather than total parameters.

The question: does the extra capacity translate to better quality? Yes, empirically. A 670B MoE with 37B active parameters consistently outperforms a 37B dense model and is competitive with a 70B-100B dense model. The additional parameters store knowledge and skills that aren’t activated for every token but are activated when relevant.

The problem: MoE models are harder to train, harder to serve, and harder to reason about than dense models. The complexity is real and pushes some labs back to dense architectures despite the quality advantage.

34.2 The basic MoE architecture

In a standard transformer block, the FFN is a single feedforward network applied to every token. In an MoE transformer block, the FFN is replaced by multiple parallel FFNs (the “experts”), and a small “router” network decides which experts to activate for each token.

Schematically:

Standard FFN block:
    x → FFN(x) → output

MoE FFN block:
    x → router(x) → top_k expert indices and weights
    x → for each selected expert: FFN_i(x)
    x → weighted sum of expert outputs → output

Specifically, for an MoE layer with E experts and top-k routing where k = 2:

The router is a small linear layer: W_router ∈ R^(d_model × E). It takes the token’s hidden state and produces a score for each expert: scores = x @ W_router.
Take the top-k scores (typically k=2). Let’s call the top-k expert indices i, j and the top-k scores s_i, s_j.
Apply softmax over just the top-k: g_i = exp(s_i) / (exp(s_i) + exp(s_j)), similarly g_j.
Compute the output of expert i and expert j on the input: o_i = FFN_i(x), o_j = FFN_j(x).
Combine: output = g_i × o_i + g_j × o_j.

The other E - 2 experts are not activated for this token. Their parameters don’t contribute to the forward pass. Each token only “uses” 2 out of E experts.

For Mixtral 8x7B: E = 8 experts, k = 2 (top-2 routing). The model has 8 experts × 7B-equivalent each + shared attention/embeddings = 47B total parameters, but each token activates only ~13B (the shared parts + 2 of the experts).

For DeepSeek-V3: E = 256 experts (with some shared), k = 8 (top-8 routing). Total 671B parameters, ~37B active per token. Much more aggressive sparsity.

The compute per token scales with the active parameters (~37B), not the total (671B). The model behaves at inference like a 37B dense model (in compute) but contains the knowledge of a 671B model.

34.3 Routing — top-k gating

The router is a tiny module — just a single linear layer (d_model → E) plus the top-k selection logic. It’s almost free in compute.

The router is trained jointly with the rest of the model. The router learns which experts to activate based on the input. Empirically, different experts specialize in different things — one might handle code, one might handle math, one might handle conversational text. The specialization is learned, not designed.

The top-k selection is the critical operation. Several details matter:

Top-k routing runs the router linear layer to score all experts, selects the top-k indices, applies softmax over those scores to get gating weights, and combines their outputs — all other experts are skipped entirely.

(1) Differentiability. Top-k is non-differentiable (selecting the top-k is a discrete operation). To make it work with backprop, the router uses softmax-then-top-k and treats the selected gates as if they had been chosen smoothly. The non-selected gates have zero gradient. There are more sophisticated approaches (Gumbel-softmax, Switch Transformer’s straight-through estimator), but the basic top-k softmax is what most production MoE models use.

(2) Load balance. Without intervention, the router might learn to send most tokens to the same few experts, leaving the others unused. This is bad for both quality (the unused experts don’t learn) and efficiency (the active experts are overloaded). The fix is the load balancing loss (next section).

(3) Token vs sequence routing. Top-k routing is per-token: each token independently picks its experts. Some research has explored sequence-level routing (one expert choice per sequence) but it hasn’t won. Per-token is the standard.

34.4 Load balancing

The deepest operational challenge in MoE training is load balancing. If the router sends 80% of tokens to expert 0, then expert 0 is overworked while experts 1-7 are idle. Several bad things happen:

Compute imbalance. In an expert-parallel deployment (Chapter 28), expert 0’s GPU is bottlenecked while the others sit idle. The whole system runs at the speed of the slowest GPU.
Capacity loss. Underused experts don’t get gradient updates and don’t learn. They become dead weight.
Quality regression. A model with effectively 1 expert is just a smaller dense model. The total parameter count is wasted.

The fix is to explicitly encourage the router to spread tokens across experts. There are several techniques:

Load balancing loss (auxiliary loss)

Add a term to the training loss that penalizes uneven routing. The standard form:

L_balance = E × Σ_e (frac_tokens_to_expert_e × frac_router_prob_to_expert_e)

This is the Switch Transformer loss (Fedus et al., 2021). It’s minimized when the routing is uniform across experts. It’s added to the main loss with a small coefficient (typically 0.01 or so) so it nudges the router toward balance without dominating.

Capacity factor and token dropping

Each expert has a capacity — a maximum number of tokens it can process per batch. If too many tokens are routed to one expert, the excess tokens are dropped (their expert output is set to zero, or routed to a fallback expert).

The capacity factor is set during training. A capacity factor of 1.0 means each expert handles exactly tokens / num_experts tokens. A capacity factor of 1.5 gives some headroom for slight imbalance. Lower capacity factors are more efficient but more aggressive about dropping; higher factors are more forgiving but less efficient.

Expert pruning during training

Some training recipes monitor for “dead” experts (consistently routed to less than threshold) and re-initialize them. This is rare in modern training because the load balancing loss usually keeps things balanced.

Fine-grained expert routing (DeepSeek-V3)

DeepSeek-V3 uses a more sophisticated routing scheme. Instead of selecting top-k from a flat pool of experts, it routes to top-k from each of several “groups” of experts. This gives more deterministic load balancing — each group is guaranteed to have some tokens — at the cost of routing flexibility.

DeepSeek-V3 also uses an auxiliary-loss-free load balancing strategy where the router’s biases are dynamically adjusted during training to maintain balance, avoiding the need for an explicit load balancing loss. This is a recent innovation and is part of why DeepSeek-V3 is so strong.

34.5 The inference cost story

This is where MoE gets interesting and confusing. The cost picture for serving an MoE model has several axes.

Compute per token

Compute per token is determined by active parameters, not total. For DeepSeek-V3 with 37B active params, the FLOPs per token are similar to a 37B dense model. So per-token compute is much lower than the 671B total parameter count would suggest.

This is the headline win of MoE: compute per token is small compared to total capacity.

Memory for parameters

Memory is determined by total parameters, not active. For DeepSeek-V3, you need 671B parameters of GPU memory to load the model — about 1.3 TB in bf16. This is much more than a 37B dense model.

So MoE shifts the cost from compute to memory. For training, this is a wash (you need both anyway). For inference, this is worse: serving a 671B MoE requires holding all 671B parameters, even though only 37B are activated per token.

HBM bandwidth — the catch

This is the most subtle part. Recall from Chapter 21 that decode is HBM-bandwidth-bound: the per-token latency is roughly model_size / HBM_bandwidth.

For an MoE model, you might naively assume the bandwidth requirement is active_params × bytes, since only those parameters are needed for the forward pass. But you have to read the active expert weights from HBM each step. And the expert that gets activated changes per token. So in the worst case (different experts per token), you need to read the active 37B params per step from HBM, which gives the same bandwidth bottleneck as a 37B dense model.

Wait, but that’s much better than the 671B total — right? Yes, if the experts can be co-located on the same GPU and the active subset is read efficiently. In practice, expert parallelism spreads experts across GPUs, and the bandwidth picture per GPU depends on which experts are local.

The bandwidth picture is complex:

Best case (active experts are on the local GPU): bandwidth scales with active params, ~37B per token.
Worst case (active experts are on remote GPUs): you have to fetch them over the interconnect, which is slower than HBM.
Real case: somewhere in between, depending on the routing pattern and the EP layout.

This is the source of the “MoE is hard to serve” reputation. The optimization isn’t just about FLOPs or memory — it’s about routing patterns and the network topology.

The summary

MoE shifts the cost from per-token FLOPs (which scale with active parameters) to total memory and all-to-all network overhead — making it cheaper to run but harder to serve at scale.

For MoE serving:

Compute per token is roughly that of a dense model with active_params parameters.
Total memory is much larger (the full total_params).
Bandwidth per token is somewhere between active and total, depending on EP and routing.
Network bandwidth between experts matters in multi-GPU EP setups.

The net: MoE gives you the quality of a much bigger dense model at the compute of a smaller dense model, but at the memory cost of the full model and with operational complexity from routing and EP.

When the trade is favorable — high-quality requirements, plenty of memory, fast interconnect — MoE wins. When memory is tight or interconnect is slow, dense models can be more practical.

34.6 Expert parallelism for serving

We covered this briefly in Chapter 28. Expert parallelism (EP) is the natural way to serve a large MoE model across multiple GPUs.

The setup:

Each GPU holds a subset of the experts (e.g., GPU 0 has experts 0-7, GPU 1 has experts 8-15, etc.).
The non-MoE parts (attention, embeddings, layer norms) are replicated or split via TP.
For each token, the router decides which experts to activate. The token is sent to the GPUs holding those experts.
The expert outputs are sent back to the originating GPU and combined.

Expert parallelism assigns each GPU a disjoint set of experts; each forward step requires an all-to-all to route tokens to the right GPU and gather results back, making fast interconnect essential.

The communication pattern is all-to-all: every GPU has tokens for every other GPU’s experts, and they all exchange in one collective operation.

The all-to-all is the dominant communication cost for MoE serving. It’s more complex than the all-reduce of TP. It demands fast interconnect (NVLink within a node, RDMA across nodes). For DeepSeek-V3, the recommended deployment is a multi-node EP setup with InfiniBand interconnect.

The latency picture:

All-to-all has higher fixed latency than all-reduce (more network round-trips).
But the per-GPU compute is much lower (each GPU only handles its own experts), so the wall-clock can still be fast.
The trade-off depends on the network speed and the model size.

For most production MoE deployments, the configuration is something like TP=4 within a node + EP across nodes. The TP handles the non-expert parts; the EP handles the experts.

34.7 Why MoE serving is harder

Concretely, the operational pain points of MoE serving:

(1) Larger memory footprint. A 671B MoE needs 1.3 TB of HBM. That’s many GPUs, period.

(2) All-to-all communication. Every step has an all-to-all. This is more complex and more expensive than all-reduce.

(3) Load imbalance at inference. Even with training-time balancing, real inference workloads can hit “popular” experts disproportionately. One expert’s GPU gets overloaded; the others sit idle.

(4) Routing variance. Different tokens go to different experts. The compute pattern is irregular, which makes scheduling and batching harder.

(5) Capacity-factor decisions. The serving stack has to decide what to do when an expert is overloaded. Drop tokens? Spill to a less-loaded expert? Both have quality and operational implications.

(6) Less mature kernel support. Until recently, MoE kernels were less optimized than dense kernels. The gap is closing, but vLLM’s MoE support is newer and less battle-tested than its dense support.

(7) Quantization is harder. Quantizing MoE models is non-trivial because the experts are individually small (so quantization noise has more impact per expert) and because the routing distribution can shift under quantization.

These are all solvable, and the open-source serving stacks (vLLM, SGLang, TensorRT-LLM) have improved their MoE support significantly. But they’re real, and they’re why some labs prefer dense models even at higher per-token compute cost.

34.8 The current state

A summary of MoE adoption as of late 2025:

Model	Year	Total params	Active params	Experts	Top-k
GShard	2020	varies	varies	up to 2048	2
Switch Transformer	2021	1.6T	~7B	2048	1
GLaM	2021	1.2T	~96B	64	2
Mixtral 8x7B	2023	47B	~13B	8	2
Mixtral 8x22B	2024	141B	~39B	8	2
DeepSeek-V2	2024	236B	21B	162 (160 routed + 2 shared)	6
DeepSeek-V3	2024	671B	37B	257 (256 routed + 1 shared)	8
Qwen 2.5-MoE	2024	varies	varies	varies	2-4
GPT-4 (rumored)	2023	~1.8T	?	16?	2?
Gemini (rumored)	2024	?	?	?	?

The pattern: fine-grained MoE with many small experts (the DeepSeek approach) is winning over coarse-grained MoE with few large experts (the Mixtral approach). The reason: more experts means more flexibility, less interference between routing decisions, and better total parameter utilization.

DeepSeek-V3’s design — 256 experts with top-8 routing — is the current state of the art for open MoE models. It’s also the most operationally complex. Serving it well requires expert parallelism across multiple nodes with fast interconnect.

Expect more models to adopt fine-grained MoE in 2025-26. The training techniques are maturing, the serving stacks are improving, and the quality advantages are real.

34.9 The mental model

Eight points to take into Chapter 35:

MoE replaces a single FFN with many parallel “experts” and a router that activates the top-k for each token.
Each token activates a small fraction of total parameters. Compute per token is roughly that of a dense model with active_params parameters.
Routing is per-token via top-k softmax, with auxiliary loss for load balancing.
Total memory is the full model size, not the active size. MoE shifts cost from compute to memory.
Expert parallelism spreads experts across GPUs. The cost is all-to-all communication every step.
MoE serving is harder than dense serving because of memory, all-to-all communication, load imbalance, and routing variance.
Fine-grained MoE (many small experts, e.g., DeepSeek-V3 with 256) is winning over coarse-grained.
The quality-per-active-param ratio is excellent for MoE, but the operational cost is real. Use MoE when quality and memory headroom matter more than operational simplicity.

In Chapter 35 we look at the other major architectural concern: long context.

Read it yourself

Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017). The foundational MoE paper for transformers.
Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (2021). Single-expert routing.
Du et al., GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (2021).
The Mixtral 8x7B paper / blog post (Mistral AI, 2023).
The DeepSeek-V2 and DeepSeek-V3 technical reports. Read the routing and load balancing sections in particular.
The vLLM documentation on MoE serving.

Practice

For Mixtral 8x7B with top-2 routing and ~13B active params, compute the per-token decode latency on H100 using the §22 formulas. Compare to a dense 13B model.
Why does fine-grained MoE (256 experts) work better than coarse-grained (8 experts) at the same total parameter count? Argue at the level of expert specialization.
The all-to-all communication in expert parallelism scales as O(K × E) per token. Why? Trace the data flow.
For DeepSeek-V3 with 256 experts and top-8 routing, what’s the chance that a token’s 8 experts are all on the same GPU under uniform random routing if the experts are distributed across 32 GPUs?
Why is quantization harder for MoE models than for dense models? Identify the failure modes.
Read the DeepSeek-V3 paper’s “Auxiliary-Loss-Free Load Balancing” section. Explain the technique and why it’s an improvement over the auxiliary loss.
Stretch: Run a small open MoE model (Qwen 2.5-MoE) with vLLM and measure its decode throughput vs a same-active-param dense model. Verify the per-token compute is similar.

Concept check