Part III · Inference Internals & Production Serving
Chapter 43 ~16 min read

Structured generation: guided decoding, JSON mode, regex constraints, FSM masking

"You can make an LLM produce valid JSON 100% of the time. The trick is not in the prompt; it's in the sampler"

We close out Stage 3 (research frontier) with the most production-relevant technique in Part III: structured generation. The setup: you have an LLM and you need it to produce output that conforms to a specific format — JSON matching a schema, code in a specific language, SQL with valid syntax, a regex-matching string. The naive approach is to ask politely in the prompt and hope. The structured approach is to constrain the sampler at every step so the model can only produce tokens that keep the output valid.

The result: you can guarantee 100% format compliance, often with better quality than prompt-based approaches, at very low overhead. Tools like Outlines, XGrammar, JSON mode in OpenAI’s API, and SGLang’s grammar-constrained decoding all use variants of this technique.

By the end of this chapter you will know:

  • Why prompt-based “respond in JSON” fails 5-15% of the time even on strong models.
  • How FSM-based masking guarantees valid output.
  • The implementation of guided decoding via logit masking.
  • The library landscape: Outlines, XGrammar, JSON Schema-based libraries.
  • How to use it in production.
  • The trade-offs: speed cost, quality cost, and when constraint-driven generation actually changes model behavior.

Outline:

  1. The structured-output problem.
  2. Why prompting alone isn’t enough.
  3. The logit masking idea.
  4. Finite state machines for constraints.
  5. JSON schema as a grammar.
  6. Outlines, XGrammar, and the library landscape.
  7. The speed cost of guided decoding.
  8. The quality picture: does constraining help or hurt?
  9. Production patterns.

43.1 The structured-output problem

Many production LLM applications need machine-parseable output. Examples:

  • A chatbot that calls tools needs to emit JSON with the tool name and arguments.
  • A coding assistant needs to emit valid Python or SQL.
  • A data extraction pipeline needs to emit JSON matching a schema.
  • A function-calling API needs valid JSON arguments matching a declared schema.
  • A configuration generator needs valid YAML.
  • A natural-language-to-SQL system needs syntactically correct SQL.

In all these cases, the consumer of the model’s output is a parser, not a human. If the model emits “almost JSON” with a missing comma or an extra field, the parser rejects it and the request fails.

The naive approach is to prompt the model to produce the right format: “Respond in JSON with the following fields…” This works most of the time. Strong models follow JSON formatting instructions correctly maybe 95% of the time. The 5% failure rate is the problem — it’s not zero, and at scale it means thousands of failed requests per day.

The failure modes are typical:

  • Trailing commas where JSON forbids them.
  • Single quotes instead of double quotes.
  • Markdown code fences around the JSON (```json ... ```).
  • Comments inside the JSON.
  • Missing required fields.
  • Extra commentary before or after the JSON.

Each is a small thing, but each breaks downstream parsing. The 95% success rate is not good enough for production. What you want is 100%.

43.2 Why prompting isn’t enough

You might think “just prompt better and you’ll get to 99%.” This works partially — clearer prompts, few-shot examples, and stronger models all push the success rate up. But there’s a fundamental ceiling: the model’s distribution over next tokens always assigns nonzero probability to invalid continuations. Even if the probability is 0.001, you’ll occasionally sample one of those invalid tokens and produce broken output.

The ceiling for prompt-based JSON compliance with frontier models in 2025 is roughly 99.5%. Better than the bad old days (where it was ~85%), but still not 100%.

For applications that need 100% — and many do — the answer is constrained generation: enforce the format at the sampler level so invalid tokens can’t be produced.

43.3 The logit masking idea

The basic mechanism of constrained generation:

  1. The model produces logits for the next token (Chapter 8).
  2. Before sampling, set the logits of all “invalid” tokens to -∞.
  3. Sample from the masked distribution. Only valid tokens have nonzero probability.
  4. Repeat for each step.

The result: the model literally cannot emit a token that would break the format. Every sampled token is one that keeps the partial output on a path to a valid complete output.

Logit masking: raw logits over vocab are masked to negative infinity for invalid tokens before softmax, ensuring only valid next tokens have nonzero probability. raw logits (all vocab) "{" "a" "[" "Hi" mask invalid masked logits (valid tokens only) "{" -inf "[" -inf softmax + sample next token always valid The FSM provides the valid-token set for each state. The mask lookup is O(1) — precomputed at schema compile time.
The model's raw logit distribution is masked to −∞ for all tokens that would violate the current FSM state — softmax then gives zero probability to invalid tokens, making format violations structurally impossible.

The hard part is computing the mask. For each step, you need to know which tokens are valid given the current partial output. This is a parsing problem: you have a partial string, and you need to know which characters could come next without making the string ungrammatical.

For simple constraints (e.g., “respond with a single integer”), the mask is easy: only digit tokens are valid until EOS. For complex constraints (e.g., “respond with JSON matching this schema”), the mask requires real parsing logic.

The general approach is to compile the constraint into a finite state machine (FSM) and use the FSM to determine valid next tokens at each position.

43.4 FSMs for constraints

A finite state machine has a set of states, an initial state, and a transition function that determines the next state given the current state and an input character. For grammar constraints:

  • The states correspond to “positions in the partial parse tree.”
  • The transitions correspond to “what characters are allowed at each position.”
  • The accepting states correspond to “complete valid output.”

For a regex constraint (e.g., “match [A-Z][a-z]+”), the FSM has explicit states: start, after-uppercase, after-lowercase. At the start, only uppercase letters are valid. After an uppercase, lowercase letters and EOS are valid. The FSM is small.

Simplified FSM for a JSON object with one required string key and string value, showing valid transitions at each parser state. S0 start S1 after { S2 in key S3 after ":" S4 in value S5 accept "}" "{" '"' '"' close key any char '"' any char '"' Each state defines which tokens are valid next. XGrammar precomputes valid-token sets per state so the mask lookup is O(1).
A minimal JSON FSM — each circle is a parser state and each edge is the token (or token class) that triggers the transition; the precomputed valid-token set per state eliminates per-step vocabulary scanning.

For a JSON schema constraint, the FSM is more complex. You have states for “expecting a key,” “expecting a colon,” “expecting a value of type X,” etc. The FSM is larger but still deterministic.

The transition isn’t quite character-by-character — it’s token-by-token, because the LLM emits tokens, not characters. To use the FSM with the LLM, you need to compute, for each token in the vocabulary, whether emitting it would lead to a valid state transition.

The naive approach: at every step, for every token in the vocabulary (say 100k tokens), simulate appending the token to the current output and check if the FSM accepts. This is 100k FSM evaluations per step — too slow.

The smart approach: precompute which tokens are valid for each FSM state. Once you know “in state X, the valid tokens are this set,” you can mask in O(1) per step (just look up the set for the current state). The precomputation is one-time and doesn’t add to per-step cost.

This is the trick that makes constrained generation fast.

43.5 JSON schema as a grammar

JSON schemas define the structure of valid JSON: which keys are required, what types each value should be, what nesting is allowed. A JSON schema can be compiled into a grammar (specifically, a context-free grammar) that describes all valid JSON instances of that schema.

The compilation:

  • Each schema field becomes a grammar rule.
  • Required fields become required productions in the grammar.
  • Nested objects become nested rules.
  • Type constraints (string, integer, etc.) become character-class constraints.

The resulting grammar is then compiled into an FSM (or, more precisely, a pushdown automaton, since context-free grammars need a stack). The FSM is used for token masking during generation.

The libraries that do this (Outlines, XGrammar, etc.) handle the compilation automatically. You provide a JSON schema; they give you a guided sampler that only emits tokens conforming to the schema.

The result: 100% schema compliance. The output is guaranteed to parse and to match the schema. No “respond in JSON” prompting needed.

43.6 The library landscape

The structured generation libraries you should know about:

graph TD
  A[Constraint source] --> B{Type?}
  B -->|JSON schema| C[Outlines / XGrammar]
  B -->|Regex| C
  B -->|Pydantic model| D[Instructor converts to schema]
  D --> C
  B -->|Custom CFG| E[Outlines CFG backend]
  C --> F{Serving stack?}
  F -->|vLLM v0.6+| G[XGrammar built-in]
  F -->|SGLang| G
  F -->|OpenAI API| H[structured_outputs mode]
  F -->|standalone| I[Outlines direct]
  style G fill:var(--fig-accent-soft),stroke:var(--fig-accent)

XGrammar is the default engine inside vLLM and SGLang; Outlines remains the standalone library for custom setups.

Outlines (Willard & Louf, 2023)

Outlines is the most prominent open-source library for structured generation. It supports:

  • JSON schema constraints (compile a schema, get a guided sampler).
  • Regex constraints (any regex pattern).
  • CFG constraints (custom context-free grammars).
  • Python type hints (Pydantic models compile to schemas).

Outlines integrates with Hugging Face Transformers and several serving stacks. It’s the “default” for many practitioners.

The implementation uses the FSM-based masking approach described above. The compilation step is fast enough to run at request time for typical schemas; for very large schemas, you can pre-compile and reuse.

XGrammar (Dong et al., 2024)

XGrammar is a more recent library that focuses on performance. The key contribution: a more efficient mask computation that brings the overhead down to <1% of total inference time.

XGrammar achieves the speedup with:

  • Cache-friendly mask precomputation. The valid-token sets for each state are precomputed and stored compactly.
  • GPU-side masking. The mask is applied directly on the GPU instead of in Python, avoiding host-device transfers.
  • Adaptive masking. Some states have very few valid tokens (e.g., “expecting a colon” → only the colon token is valid). XGrammar special-cases these.

XGrammar is faster than Outlines and is used as the default in vLLM v0.6+ and SGLang. If you’re doing structured generation in production, you’re probably using XGrammar (possibly without knowing it).

JSON Mode (OpenAI / Anthropic)

The major API providers offer “JSON mode” or “structured output” features. OpenAI’s response_format={"type": "json_object"} enforces JSON validity at the sampler level (similar to Outlines/XGrammar). Anthropic has a similar feature.

These are convenient but proprietary. For self-hosted serving, use vLLM/SGLang’s built-in support.

Other libraries

  • Guidance (Microsoft) — a templating + constrained-generation library. Older approach.
  • LMQL — a query language for LLMs with constrained generation as a primitive.
  • Instructor — a Pydantic-based wrapper for structured output.
  • JSONformer — a simple JSON-only constrainer.

The space is mature. For production, use XGrammar via vLLM/SGLang or the API providers’ built-in features.

43.7 The speed cost

Constrained generation has overhead. The mask computation, even when optimized, adds work per step. The early implementations had significant overhead (~10-30% slower than unconstrained generation). Modern implementations (XGrammar) have brought it down to <1%.

The cost depends on:

  • Schema complexity. Simple regex patterns are cheap; deeply nested JSON schemas are more expensive.
  • Mask sparsity. If most tokens are invalid (typical for tightly constrained schemas), the mask is essentially “drop everything except this small set” and is fast. If many tokens are valid, the mask is dense and slower.
  • Implementation quality. XGrammar is much faster than naive implementations because of the precomputation and GPU-side masking.

For typical JSON schema constraints with XGrammar in vLLM, the overhead is negligible (1-5%). The benefit (100% compliance) is much larger than the cost.

43.8 The quality picture

A subtle point: does constraining the sampler hurt model quality?

The naive answer is “no — you’re just removing invalid options.” But it’s more complex.

When you mask tokens, the probability mass that was on the masked tokens gets redistributed across the remaining tokens (after re-normalization). In some cases this is fine; in others it shifts the model’s behavior in ways you didn’t intend.

For example, suppose the model is generating a JSON value and the schema requires it to be a string. Without constraints, the model might emit either a string or a number with similar probabilities. With constraints, the model’s “wanted to emit a number” probability mass gets shifted to “emit a string.” The resulting string might be lower-quality than what the model would naturally choose.

For most production cases, the quality difference is small. The benefit (no parse failures) is much larger than the cost.

But there are edge cases where structured generation can hurt:

  • When the schema is very tight, the model might be forced into output it didn’t really “want” to produce.
  • For creative tasks, constraints can make the output feel mechanical.
  • For complex schemas, the constrained output can be less coherent than free-form output that you parse loosely.

The general advice: use structured generation when format compliance matters more than the marginal quality cost. For most production data extraction, function calling, and tool use, this is the right trade-off. For creative writing or open-ended chat, leave the model free.

43.9 Production patterns

How structured generation is actually used in production:

(1) Function calling. OpenAI’s function calling API uses structured generation under the hood. You declare a function’s argument schema; the model emits valid JSON arguments. This is the dominant use case.

(2) Data extraction. Run a model over unstructured text with a JSON schema for the fields you want to extract. The output is guaranteed to be parseable.

(3) Tool use in agents. Agent loops (Chapter 67) use structured generation to ensure the model emits valid tool calls. Without it, agents are much less reliable.

(4) Code generation. Constrain output to be syntactically valid code in a specific language. Less common because most code generators just use parse-and-retry.

(5) Configuration generation. Generate YAML, TOML, or other config files with schema constraints.

(6) RAG with citations. Constrain the output to include valid citation markers (“[1]”, “[2]”, etc.) that map to retrieved documents.

The pattern: anywhere you need machine-parseable output with high reliability, structured generation is the right tool. The overhead is small; the reliability is much higher than prompt-based approaches.

In vLLM, you enable it via the guided_json or guided_regex parameters in the request. Pass a JSON schema or a regex; vLLM (with XGrammar) constrains the output. The same APIs exist in SGLang and TensorRT-LLM.

43.10 The mental model — Stage 3 capstone

Eight points to take into Stage 4 of Part III:

  1. Structured generation guarantees machine-parseable output by masking the sampler.
  2. The mechanism is FSM/grammar compilation + per-step token masking.
  3. JSON schema as a grammar — schemas compile to FSMs that constrain output to valid instances.
  4. XGrammar is the modern fast implementation, used by vLLM and SGLang.
  5. The speed cost is minimal (<5%) with modern implementations.
  6. The quality cost is small but real for some workloads. The benefit usually outweighs it.
  7. Production uses include function calling, tool use in agents, data extraction, code generation, configuration.
  8. Use structured generation when format compliance matters. For free-form text, leave the model free.

This is the end of Stage 3 (research frontier) of Part III. You’ve now seen:

  • Stage 2 (practitioner internals): prefill/decode, KV cache, batching, PagedAttention, FlashAttention, quantization, speculative decoding, parallelism, prefix caching, cost, latency, multimodal.
  • Stage 3 (research frontier): attention compression (MHA/MQA/GQA/MLA), MoE, long context, disaggregated prefill/decode, KV cache offload, kernels, SSMs/Mamba, test-time compute, structured generation.

In Stage 4 (production), starting Chapter 44, we put it all together. The serving framework landscape, KServe, vLLM in production with every flag explained, AI gateways, autoscaling, the cold start problem, and the operational reality of running an inference platform at scale.


Read it yourself

  • Willard & Louf, Efficient Guided Generation for Large Language Models (2023). The Outlines paper.
  • Dong et al., XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models (2024).
  • The Outlines GitHub repository and documentation.
  • The XGrammar GitHub repository and the vLLM integration.
  • The OpenAI documentation on Structured Outputs and Function Calling.
  • The vLLM and SGLang documentation on guided generation.

Practice

  1. Write a regex for a North American phone number. Use Outlines or XGrammar to constrain a small model to emit valid phone numbers. Verify 100% of outputs match the regex.
  2. Why does prompt-based “respond in JSON” fail 5-15% of the time? Identify three specific failure modes.
  3. Explain how an FSM is built from a regex pattern. Walk through the construction for [A-Z][a-z]+@[a-z]+\.com.
  4. Why does XGrammar precompute valid-token sets per FSM state? What’s the alternative, and why is it slower?
  5. Read the Outlines paper. Identify the FSM-construction algorithm and its complexity.
  6. When would you NOT want to use structured generation? Give two examples.
  7. Stretch: Use vLLM’s guided JSON mode to generate Pydantic-validated objects from a small open model. Compare quality to prompt-based JSON output on the same prompts.