Part III · Inference Internals & Production Serving
Chapter 41 ~17 min read

State-space models: Mamba, Mamba-2, hybrid architectures

"Maybe attention isn't all you need"

For nearly a decade, the answer to “what sequence model should I use?” has been “the transformer.” Attention was the operation, and everything else was an afterthought. Around 2023, a different lineage of architectures started showing competitive results: state-space models (SSMs), and specifically Mamba (Gu & Dao, 2023). These models use a different mechanism for handling sequence dependence — one based on linear-time recurrent updates rather than quadratic-time attention.

This chapter is the deep tour of state-space models for an LLM systems audience. You won’t write SSMs in production unless you’re at a frontier lab, but you should know the architecture, the trade-offs, and why hybrid attention-SSM models are becoming a real thing in 2024-25.

Outline:

  1. The motivation: linear-time sequence modeling.
  2. State-space models: the math.
  3. The selectivity problem and how Mamba solves it.
  4. The Mamba architecture in detail.
  5. Mamba-2 and the duality with attention.
  6. Inference cost: why SSMs change the picture.
  7. Hybrid attention-SSM models.
  8. Quality vs transformers.
  9. The state of the art in 2025.

41.1 The motivation: linear-time sequence modeling

Recall the cost of attention from Chapter 6: O(s²) in both compute and memory. Every attention operation pays this quadratic price. For long sequences, the cost is prohibitive — even with all the optimizations from Stage 2 (FlashAttention, KV cache, GQA), attention’s fundamental scaling is still quadratic.

The dream: a sequence operator that’s linear in length. Compute scales as O(s × d²), memory scales as O(d²) (constant in s). For a 1M-token sequence, linear-time would be 1000× faster than quadratic-time.

There have been many attempts at linear sequence operators:

  • Recurrent neural networks (LSTMs, GRUs) — linear in compute but sequential in time, so they can’t be parallelized during training. They were displaced by transformers because the parallelization was so much more important than the asymptotic cost.
  • Linear attention — approximations of attention that compute it in linear time. These work in toy settings but don’t match full attention quality on language modeling.
  • Sparse attention (Chapter 35) — attention with structured sparsity. Subquadratic but only on certain patterns, and quality suffers.
  • State-space models — the modern answer. Linear in compute, parallelizable in training, and (after Mamba’s selectivity trick) competitive with attention on language modeling.

State-space models are the most promising linear-time alternative to attention as of 2025. They’re not yet replacing transformers in production, but they’re real and they’re improving.

41.2 State-space models: the math

The classical state-space model in continuous time is a linear system:

dx/dt = A x(t) + B u(t)        # state update
y(t)  = C x(t) + D u(t)        # output

Where x(t) is a hidden state vector, u(t) is the input, y(t) is the output, and A, B, C, D are matrices. This is the standard linear control theory model — used for everything from electrical circuits to flight controllers.

For discrete sequences (like language tokens), you discretize to get:

x_t = A_bar × x_{t-1} + B_bar × u_t       # state update
y_t = C × x_t                              # output

A_bar, B_bar are discretized versions of A, B. The recurrence is linear and Markovian — x_t depends only on x_{t-1} and u_t, not on the full history.

The naive view: this is just an RNN with a specific structure. So why is it interesting?

SSM recurrence: the hidden state x_t is updated from x_{t-1} and input u_t; the output y_t depends only on x_t and u_t, not the full history. x_{t-1} state x_t state x_{t+1} state · · · u_{t-1} u_t u_{t+1} y_{t-1} y_t y_{t+1} x_t = A_bar × x_{t−1} + B_bar × u_t (linear recurrence — parallelizable!) y_t = C × x_t (output only depends on current state)
The SSM recurrence is linear in x — that is the key property that enables parallel scan training while keeping inference a simple O(d²) state update per step with no growing KV cache.

Because of the parallelization trick. A linear recurrence of this form can be parallelized using a parallel scan algorithm. Instead of computing x_1, x_2, x_3, ... sequentially, you can compute all of them in O(log s) parallel steps using prefix-scan operations. This gives you the parallelism advantages of attention (training-time parallelization) with the linear-time advantages of RNNs (inference-time efficiency).

The key insight: linear recurrences are parallelizable, nonlinear ones are not. By keeping the state update linear, SSMs get the best of both worlds.

41.3 The selectivity problem

The classical SSMs (S4, S5, etc., from Albert Gu’s group at Stanford starting in 2021) had a problem: they used input-independent state updates. The matrices A_bar, B_bar were the same for every token. This made them efficient (you could pre-compute the transition kernels) but limited their expressiveness.

Specifically, classical SSMs were bad at content-based reasoning: tasks where the model needs to do something different based on what the input token actually is. Attention is naturally good at this because the attention scores depend on the content; classical SSMs used the same dynamics regardless of content.

Mamba’s contribution: make the state update input-dependent. The matrices A_bar, B_bar, C are now functions of the input u_t:

A_bar(u_t), B_bar(u_t), C(u_t)

This is the selective state space model of Mamba (S6 in the naming convention). The model can selectively “remember” or “forget” based on what the current input is. A token that says “the answer is 42” can update the state to remember 42; a filler token can leave the state unchanged.

The cost is that the parallel scan is harder — you can’t pre-compute the kernels — but the Mamba paper showed it’s still feasible with a careful implementation. The kernels are written in CUDA and use the GPU’s parallel scan primitives.

The result: Mamba achieves transformer-quality language modeling at linear time complexity. This was the big news of 2023.

41.4 The Mamba architecture in detail

A Mamba block is a different shape than a transformer block. Instead of attention + FFN, it has:

input x (N, S, D)
    |
    v
LayerNorm
    |
    v
Linear projections to (Δ, B, C) -- the input-dependent SSM parameters
    |
    v
Selective SSM scan (linear time, parallel)
    |
    v
Gating with another linear projection
    |
    v
Linear output projection
    |
    v
+ residual
    |
    v
output (N, S, D)

The block has approximately the same parameter count as a transformer block of the same hidden dim. The compute cost is linear in sequence length instead of quadratic. The “memory” of the model is the SSM hidden state (a fixed-size vector (N, D_state)), not the per-position KV cache.

A Mamba model is built by stacking many of these blocks, just like a transformer. There’s no attention anywhere in pure Mamba.

The key implementation challenge is the selective scan kernel. Naive implementation is slow because the input-dependent transition matrices break the standard parallel-scan optimization. The Mamba paper provides a custom CUDA kernel that’s roughly competitive with FlashAttention in throughput at typical sequence lengths.

41.5 Mamba-2 and the duality

Mamba-2 (Dao & Gu, 2024) was a major refinement. The key result: SSMs and attention are mathematically related in a specific way. Specifically, you can view a particular form of attention as a special case of an SSM, and vice versa. The paper calls this “structured state-space duality” (SSD).

The practical implication: Mamba-2 uses a slightly different formulation that allows it to be implemented with the same hardware-optimized kernels as attention (specifically, the matrix multiplications that Tensor Cores are optimized for). This makes Mamba-2 much faster in practice than the original Mamba, which used custom scan kernels.

Mamba-2’s parameter count is similar to Mamba’s, but the compute is more efficient on modern GPUs. It’s also easier to scale because the kernel infrastructure already exists for the matmul-style operations.

The duality also gives a cleaner theoretical understanding. The claim “attention is just a special case of SSMs” isn’t quite right — attention has quadratic complexity, SSMs have linear — but the relationship is close enough that intuitions transfer between the two.

Mamba-2 is the SSM you’d actually use in 2024-25 if you wanted to use SSMs at all. The original Mamba is mostly historical at this point.

41.6 Inference cost — why SSMs change the picture

The inference picture for SSMs is dramatically different from transformers:

Transformer decode requires reading the full growing KV cache each step while SSM decode uses a fixed-size state; SSM memory cost is constant regardless of context length. Transformer decode KV cache grows with context t=1: K₁V₁ t=4: K₁V₁ K₂V₂ K₃V₃ K₄V₄ t=8: reads all 8 KV pairs ↑ KV cache: O(s) — grows forever Mamba / SSM decode Fixed-size state regardless of context t=1: state 16 KB t=4: state 16 KB (same) t=8: state 16 KB (same) State: O(1) — fixed forever No KV cache pressure at any context length.
At inference, a transformer must read a KV cache that grows linearly with tokens generated; a Mamba SSM maintains a constant fixed-size state per layer and overwrites it each step, making long-context serving dramatically cheaper in memory.

Memory. A Mamba model has a fixed-size state per layer: (N, D_state), where D_state is typically 16 or 64. The state size is independent of the sequence length. There is no KV cache. Or rather, the “cache” is the small fixed state, which is hundreds of times smaller than a transformer’s KV cache.

For Mamba 2.8B (a typical size) with D_state = 16, d_model ≈ 2560, and 64 layers, the per-sequence state is 2560 × 16 × 2 × 64 = 5,242,880 bytes~5 MB. Compare to a transformer of similar size with thousands of MB of KV cache for a 32k-token context.

This is a huge advantage: memory pressure is no longer the bottleneck for SSMs. You can serve many more concurrent users on the same hardware. Long contexts are essentially free in memory.

Compute. Both prefill and decode are linear in sequence length. Prefill is O(s × d²) instead of O(s² × d). Decode is constant time per step (since the state is fixed-size, the recurrence step is O(d²) regardless of how long the past is).

For long contexts (32k+ tokens), SSMs are much faster than transformers — both in prefill (because of the linear scaling) and in decode (because each step is constant-time).

Latency. The sequential decode step is comparable in latency to a transformer decode step on a per-step basis. So per-token latency is similar; the savings come from sequence length scaling.

Bandwidth. SSMs are still memory-bandwidth-bound for decode (you have to read the model weights every step), but the per-token bandwidth requirement is similar to a transformer of the same parameter count. The savings are in the cache, not in the per-step bandwidth.

The inference cost picture for a long-context workload (say, 128k tokens):

MetricTransformer (attention)Mamba (SSM)
Per-token KV/state size320 KB (Llama 3 70B)~5 MB total (Mamba 2.8B — fixed, not per-token)
Total state for 128k context40 GB~5 MB (constant regardless of length)
Prefill compute (per token)linear in current positionconstant
Decode compute (per token)linear in past length (KV scan)constant

For long contexts, SSMs are dramatically more efficient. This is the strongest argument for SSMs in production: long-context inference at low cost.

41.7 Hybrid attention-SSM models

The catch: pure SSMs are competitive but not strictly better than transformers on language modeling quality. The empirical results show that pure Mamba models match transformers on perplexity and most benchmarks but lag on tasks that require fine-grained content-based reasoning — the kind of thing attention excels at.

The solution: hybrid models that combine attention and SSM blocks. A typical hybrid architecture has, e.g., 10-20% attention layers and 80-90% SSM layers, distributed throughout the network. The attention layers handle the content-based reasoning; the SSM layers handle the bulk of the sequence processing.

The notable hybrid models:

graph TD
  subgraph Hybrid["Hybrid Mamba-Transformer Block Stack (Jamba-style)"]
    A1[Mamba block] --> A2[Mamba block]
    A2 --> A3[Mamba block]
    A3 --> A4[Mamba block]
    A4 --> ATN["Attention block ← content reasoning"]
    ATN --> B1[Mamba block]
    B1 --> B2[Mamba block]
    B2 --> B3[Mamba block]
    B3 --> BTN["Attention block"]
  end
  style ATN fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style BTN fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Hybrid architecture: ~80–90% of blocks are Mamba (linear time, no KV cache), with sparse attention blocks every N layers to handle content-based retrieval that pure SSMs struggle with.

  • Jamba (AI21, 2024) — a hybrid Mamba-Transformer model with MoE. ~50B parameters, mix of attention and Mamba blocks. The first major commercial deployment of a Mamba-based architecture.
  • Zamba (Zyphra, 2024) — small hybrid models (1.2B, 7B) with shared attention layers across the SSM stack.
  • Samba (Microsoft, 2024) — combines Mamba with sliding window attention.
  • TTT models (Test-time training, 2024) — a related lineage that uses SSM-like dynamics.

Hybrid models capture most of the SSM efficiency benefits while preserving the content-reasoning quality of attention. As of late 2025, hybrid models are the most realistic path for SSMs to enter production at frontier scale.

41.8 Quality vs transformers

The honest assessment of SSMs vs transformers on language modeling quality, as of late 2025:

  • Pure SSMs (Mamba, Mamba-2) match transformers on perplexity at moderate scale. They lag slightly on tasks that require precise retrieval or content-based reasoning. The “needle in a haystack” benchmark, in particular, is hard for pure SSMs because the model can’t easily attend to a specific past token.

  • Hybrid models match or exceed transformers on most benchmarks. The few attention layers handle the retrieval-style tasks; the SSM layers handle the bulk processing.

  • At the largest scale, the picture is less clear. Most frontier models are still pure transformers, and the SSM line of work hasn’t been demonstrated at 70B+ parameters with the same training compute.

The frontier labs have all run experiments on SSM and hybrid architectures. None of them have publicly released a frontier-scale SSM model (yet). The reasons are operational and inertial more than technical — the transformer infrastructure is mature, the SSM infrastructure is not.

For the next generation, expect more hybrid models, especially in the long-context space where SSMs have a clear advantage.

41.9 The state of the art in 2025

Where SSMs sit as of late 2025:

ModelYearArchitectureNotes
S42021Pure SSMThe classical baseline. Not selective.
Mamba2023Selective SSM (S6)The breakthrough. Custom CUDA kernels.
Mamba-22024Selective SSM with SSDMatmul-friendly. Fast on modern GPUs.
Jamba2024Hybrid Mamba+Attention+MoEFirst major commercial hybrid model.
Zamba2024Hybrid SSM with shared attentionSmall efficient models.
Samba2024Mamba + sliding window attentionMicrosoft research.
Codestral Mamba2024Code-focused MambaMistral’s first SSM-based model.
Falcon Mamba20247B MambaTII’s pure Mamba release.

The trend: SSMs are moving from pure research to product. Hybrid models are taking the lead. Pure SSMs work for specific use cases (long context, code, low-resource) but haven’t displaced transformers at the frontier.

The next 12-24 months will show whether SSMs become a fundamental part of the production stack or remain a research curiosity. The inference efficiency advantages are real; the quality competition is genuine; what’s lacking is operational maturity. As Mamba-2 kernels mature and hybrid architectures get more battle-tested, expect more production deployments.

41.10 The mental model

Eight points to take into Chapter 42:

  1. State-space models are linear-time sequence operators that compete with attention.
  2. Classical SSMs (S4) were input-independent and weak at content-based reasoning.
  3. Mamba introduced selective state spaces (S6) — input-dependent transitions — which closed the quality gap.
  4. Mamba-2 uses structured state-space duality to map SSM operations onto matmul-friendly kernels. Faster on modern GPUs.
  5. Inference advantage: SSMs have constant-size state per sequence (no KV cache!), so memory pressure and long-context cost drop dramatically.
  6. Quality: pure Mamba matches transformers on perplexity but lags on retrieval tasks. Hybrid models (some attention layers + many SSM layers) get the best of both.
  7. Notable hybrids: Jamba, Zamba, Samba, plus several research models from frontier labs.
  8. State of the art: SSMs are moving from research to product. Hybrid is the leading direction. Long context is the killer use case.

In Chapter 42 we look at the other major frontier shift in 2024-25: reasoning models and test-time compute.


Read it yourself

  • Gu et al., Efficiently Modeling Long Sequences with Structured State Spaces (S4, 2021).
  • Gu & Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023).
  • Dao & Gu, Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2, 2024).
  • Lieber et al., Jamba: A Hybrid Transformer-Mamba Language Model (2024).
  • The Mamba GitHub repository and the Triton scan kernels.
  • Albert Gu’s blog posts on the design intuitions for SSMs.

Practice

  1. Why are linear recurrences parallelizable but nonlinear recurrences are not? Argue from the parallel scan algorithm.
  2. What’s the difference between input-independent and input-dependent SSMs? Why does the latter matter for language modeling?
  3. Compute the per-token state size for a Mamba 7B (typical config: D_state = 16, n_layers = 32, d_model = 4096) and compare to a Llama 7B’s KV cache per token.
  4. Why does Mamba-2 use “structured state-space duality” — what’s the practical advantage over the original Mamba?
  5. Read the Jamba paper. Identify the ratio of attention to Mamba layers. Why this specific ratio?
  6. If SSMs are linear in sequence length and transformers are quadratic, why do most frontier labs still use transformers? Argue both technical and inertial reasons.
  7. Stretch: Run a small Mamba model (e.g., state-spaces/mamba-2.8b) and a similar-sized Llama model on a long-context retrieval task. Compare quality and per-token latency.