Chapter 56: Content safety as inference: guardrails architecture

This is the closing chapter of Part III. We’ve covered everything from the prefill/decode asymmetry through paged attention through MoE serving through reasoning models. The last topic is content safety — the layer of models and systems that prevents the LLM platform from producing or accepting harmful content. It’s positioned in this part because content safety is itself an inference workload, with all the same operational considerations as the LLMs it guards.

This chapter covers the architecture of content safety systems, the dual-provider pattern (rules + LLM), where guardrails sit in the request path, the latency budget, and the operational considerations.

Outline:

The content safety problem.
The two main approaches: rules-based and LLM-based.
The dual-provider architecture.
AWS Bedrock Guardrails.
Llama Guard, Prompt Guard, Shield.
Where guardrails sit in the request flow.
Latency budgets.
Output guardrails vs input guardrails.
Operational considerations.

56.1 The content safety problem

Production LLM platforms have to enforce content policies. The reasons:

Legal: certain content (CSAM, illegal activities, doxing) has legal exposure.
Brand: operators don’t want their LLM producing offensive, hateful, or incoherent content.
Trust: users trust products that don’t surprise them with harmful outputs.
Business: B2B customers require their LLM provider to filter according to their policies.
Regulatory: jurisdictions are increasingly requiring content filtering for certain use cases.

The technical problem: detect harmful content in user inputs and model outputs, in real time, with low latency, and high accuracy.

The “harmful content” is broad:

Hate speech, harassment, threats.
Sexual content (especially involving minors).
Self-harm and suicide promotion.
Violence, weapons, attacks.
Illegal activities (drugs, fraud, trafficking).
PII and sensitive data leakage.
Disinformation.
Prompt injection attempts (where users try to break out of the system prompt).

Each category has its own detection approach. Each platform has its own policies. The combination is what content safety systems handle.

56.2 The two main approaches: rules-based and LLM-based

There are two fundamentally different approaches to content detection:

Rules-based detection

Pattern matching against known harmful content. Examples:

Keyword blocklists: lists of forbidden terms. Simple, fast, brittle.
Regex patterns: more flexible than keywords. Can match phone numbers, credit cards, etc.
Hash-based matching: known harmful images/text are hashed; new content is hashed and compared. Used for CSAM detection (NCMEC’s PhotoDNA, Microsoft’s PhotoDNA).
Classifier models: classical ML classifiers (logistic regression, gradient boosting) trained on labeled examples.

Strengths:

Fast. Sub-millisecond per check.
Predictable. Same input → same output, always.
Auditable. Easy to explain why something was flagged.

Weaknesses:

Brittle. Easy to circumvent with paraphrasing, leetspeak, or just using synonyms.
High false positive rate. Pattern matching catches benign content that happens to match.
Doesn’t handle context. “I want to kill my plants” is fine; “I want to kill my neighbor” is not. Rules can’t tell the difference.
Maintenance burden. New harmful patterns appear constantly; the lists need updating.

LLM-based detection

Use a small LLM to classify content as harmful or benign. The model is trained on labeled examples and can understand context.

Examples:

Llama Guard (Meta) — a fine-tuned Llama model specifically trained for content moderation.
Llama Prompt Guard — focused on prompt injection detection.
OpenAI Moderation API — OpenAI’s hosted classifier.
Custom safety classifiers — fine-tunes on a base LLM for specific policies.

Strengths:

Context-aware. Can distinguish “kill my plants” from “kill my neighbor.”
Generalizes. Detects new variations of harmful content without explicit rules.
High accuracy with proper training data.

Weaknesses:

Slow. A small LLM forward pass is 10-100ms. Not free.
Less predictable. Same prompt can give different classifications across model versions.
Adversarial vulnerabilities. Prompt injection attacks against the safety model.
Compute cost. Every request now triggers an additional LLM forward pass.

The two approaches are complementary. Most production safety systems use both.

56.3 The dual-provider architecture

The standard pattern is dual-provider: run a fast rules-based check first, then a more expensive LLM-based check if the rules-based check is ambiguous.

[Request]
    |
    v
[Rules-based check]
    |
    +---> Clearly harmful → Block
    |
    +---> Clearly safe → Pass through
    |
    +---> Ambiguous → [LLM-based check]
                          |
                          +---> Harmful → Block
                          +---> Safe → Pass through

This minimizes the average latency: most requests are clearly safe and never hit the expensive LLM check. The LLM check is reserved for the gray area.

In practice, the rules-based check might be:

Hash-matching against known CSAM (always blocks).
Pattern matching for explicit threats.
Length checks (very long inputs are suspicious).
PII detection.

The LLM-based check is then for:

Subtle harmful content that pattern matching misses.
Context-sensitive judgments.
Novel attack patterns.

The combination gives you both speed (most requests pass quickly) and accuracy (the hard cases get the smart check).

A more elaborate version uses multiple LLM judges in parallel for the hard cases — one for hate speech, one for CSAM-related content, one for self-harm, etc. — and combines their verdicts. This is more accurate but more expensive.

The dual-provider architecture keeps average safety latency low: most requests never reach the expensive LLM judge, paying only the sub-millisecond rules-based cost.

56.4 AWS Bedrock Guardrails

AWS Bedrock Guardrails is the most widely-used managed content safety service. It’s part of AWS’s Bedrock LLM platform and provides a configurable filtering layer for both input and output.

Bedrock Guardrails offers:

Topic filters: deny lists of topics (“legal advice,” “medical advice,” “explicit sexual content”).
Content filters: classifiers for hate, insults, sexual content, violence, misconduct.
Word filters: explicit blocklists.
PII filters: detect and redact PII (SSNs, credit cards, phone numbers, etc.).
Contextual grounding checks: verify that LLM outputs are grounded in the source material.

The configuration is declarative — you describe the policy in JSON, AWS enforces it. Bedrock Guardrails applies to any model running on Bedrock, including third-party models (Claude, Llama, Mistral) hosted there.

Many enterprise teams use Bedrock Guardrails as their default safety layer because it’s pre-integrated with their AWS stack and has good coverage out of the box.

56.5 Llama Guard, Prompt Guard, Shield

Meta has released a family of open models for content safety:

Llama Guard

Llama Guard (Meta, 2023) is a fine-tuned Llama 7B (and later variants) trained for content moderation. It takes a (prompt, response) pair as input and outputs a JSON verdict:

{
  "category": "S1 (Violent Crimes)",
  "verdict": "unsafe"
}

Llama Guard is dual-purpose: it can check user prompts (for prompt safety) and model responses (for output safety). It’s typically run as a separate inference service alongside the main LLM.

The categories are aligned with the MLCommons safety taxonomy: violent crimes, non-violent crimes, sex-related crimes, child exploitation, defamation, specialized advice, privacy, intellectual property, indiscriminate weapons, hate, suicide, sexual content, elections, code interpreter abuse.

Llama Guard 3 (the current version) is a 7B model. Inference takes ~50ms per check on a single H100. Combined with the main LLM, it adds 5-10% latency overhead.

Llama Prompt Guard

A separate, smaller model focused specifically on prompt injection detection. Prompt injection is when a user tries to manipulate the LLM by including instructions in their input (“ignore the above and do this instead”). Prompt Guard is a small classifier (~110M parameters) that detects this pattern.

Run before the main LLM. If Prompt Guard flags an input, you can refuse, sanitize, or pass through with extra caution.

Shield Gemma (Google) and other open guards

Google’s Shield Gemma is a similar product: a fine-tuned Gemma model for content moderation. The space is competitive; expect more open safety models as the field matures.

56.6 Where guardrails sit in the request flow

The architectural question: where do you put the safety checks?

Option A: Before the LLM (input guardrails)

[Request] → [Input Safety Check] → [LLM] → [Response]

The check runs first. If it flags the input, you reject the request without ever calling the LLM. Saves the LLM compute for unsafe requests.

The downside: input checks can’t catch model misbehavior (the model still might respond unsafely to a safe-looking prompt).

Option B: After the LLM (output guardrails)

[Request] → [LLM] → [Output Safety Check] → [Response]

The check runs on the model’s output. If it flags the output, you replace it with a safe default (“I can’t help with that”). Catches model misbehavior even on benign-looking inputs.

The downside: you’ve already paid the LLM compute cost. And if you’re streaming, you have to either wait for the whole response (losing streaming UX) or check incrementally (more complex).

Option C: Both (recommended for production)

[Request] → [Input Check] → [LLM] → [Output Check] → [Response]

Best coverage. Input check rejects clearly unsafe prompts cheaply; output check catches anything the input check missed plus model misbehavior.

The cost is doubled safety latency (input + output checks).

For most production deployments, dual checks are the right choice. The latency cost is acceptable; the safety improvement is real.

Input guardrails save LLM compute by blocking harmful prompts before generation; output guardrails catch cases where the model produces unsafe content from a benign-looking input.

For streaming responses, the output check has two flavors:

Blocking: wait for the full response, check it, return either the response or a safe default. Loses streaming.
Incremental: check chunks as they’re generated. Stream the response unless a check fails partway through. More complex but preserves streaming UX.

Most production systems use blocking for now because incremental is harder to get right.

56.7 Latency budgets

Safety checks add latency. The budget breakdown for a typical chat request:

Step	Latency
Input safety check (rules-based)	5 ms
Input safety check (LLM-based, if triggered)	50 ms
Main LLM TTFT (1000-token prompt)	300 ms
Main LLM decode (200-token response, streaming)	10000 ms
Output safety check (LLM-based)	50 ms
Total user-perceived	~10400 ms

The safety checks add ~100ms total in the typical case (input rules + output LLM). For a 10-second response, that’s 1% overhead. Acceptable.

For very fast responses (a 50-token response in 1 second), 100ms of safety overhead is 10% — more noticeable but still acceptable.

For streaming responses, the input check happens before the first token (adds 5-50ms to TTFT). The output check is at the end (adds 50ms to e2e but doesn’t affect TTFT).

The latency budget is the main reason rules-based checks are cheap and run first. They’re nearly free; most requests pass through them quickly. The LLM check is only run when needed.

Safety checks add roughly 100ms to a 10-second streaming response — about 1% overhead — making dual input+output guardrails operationally affordable.

56.8 Output guardrails vs input guardrails

A subtle point worth being explicit about: input guardrails and output guardrails catch different things.

Input guardrails catch:

Users asking for harmful content directly.
Users attempting prompt injection.
PII in user input (which you might want to redact before logging).
Inputs that violate topic policies.

Output guardrails catch:

Model hallucinations of harmful content (the model invents something bad even on a safe input).
Model leaking PII from its training data.
Model failing to refuse a borderline request.
Model outputting copyrighted content.

The two are complementary. A user who explicitly asks for instructions to make a weapon should be caught by the input guardrail. A user who asks an innocent-seeming question that the model unexpectedly answers with bomb-making instructions should be caught by the output guardrail. Production systems need both.

56.9 Operational considerations

A few production-relevant points:

(1) Safety models are an inference workload. They have all the same operational concerns as the main LLM: deployment, autoscaling, monitoring, latency. Run them on TEI or vLLM (depending on whether they’re encoder or decoder models) with proper autoscaling.

(2) Per-tenant policies. Different customers may have different content policies. The safety layer needs to support per-tenant configuration.

(3) Logging and audit. Safety decisions should be logged for audit. When a request is blocked, why? What was the input? What did the safety check return? This is important for debugging and for legal compliance.

(4) Human-in-the-loop. For high-stakes decisions, route some safety calls to a human reviewer. This is the appeals path. Less common in real-time but important for accuracy.

(5) False positive rate. Safety checks make mistakes. The cost of a false positive (blocking a legitimate request) is non-trivial. Monitor and tune to balance false positives against false negatives.

(6) Adversarial robustness. Users try to circumvent safety checks. Monitor for new attack patterns and update the rules / retrain the models.

(7) Versioning. Safety models change. New versions may have different verdicts on the same content. Version your safety layer separately from the main LLM.

(8) Latency monitoring. Safety checks should be monitored independently. If the safety LLM is slow, it adds latency to every request. Alert on safety latency separately from main LLM latency.

56.10 The mental model

Eight points to take into Part IV (information retrieval and RAG):

Content safety is its own inference workload with the same operational concerns as the main LLM.
Rules-based and LLM-based are the two approaches. Used together.
Dual-provider architecture: cheap rules-based first, expensive LLM-based for the gray area.
AWS Bedrock Guardrails, Llama Guard, Shield Gemma are the leading options.
Input and output guardrails catch different things. Production needs both.
Latency budget: ~100ms total safety overhead is typical and acceptable.
Stream-aware output checks are harder; most systems do blocking checks for now.
Operational concerns: per-tenant policies, audit logging, false positive monitoring, adversarial robustness.

This closes Part III (Inference Internals & Production Serving). You now have the full picture of how to serve LLMs in production: the runtime layer, the orchestration, the gateway, the autoscaler, the cache, the observability, the safety. In Part IV we shift to the next major topic: information retrieval and RAG.

Read it yourself

The Llama Guard paper (Meta, 2023).
The Llama Prompt Guard documentation.
The AWS Bedrock Guardrails documentation.
The Google Shield Gemma documentation.
The MLCommons AI Safety taxonomy.
The OpenAI Moderation API documentation.

Practice

Why are rules-based and LLM-based checks complementary? Construct a scenario where each catches what the other misses.
Design a content safety architecture for a customer support chatbot. What input checks? What output checks? What latency budget?
Run Llama Guard on a few test inputs. Compare its verdicts to your intuitions.
For a streaming response, why is incremental output checking harder than blocking? Identify the technical challenges.
Why do safety checks need per-tenant policies? Construct a scenario.
What’s the cost of a false positive vs a false negative in content safety? How does the trade-off depend on the use case?
Stretch: Set up Llama Guard as a separate vLLM service. Build a wrapper service that runs input and output safety checks around a main LLM. Measure the latency overhead.