Design a content moderation pipeline
"A good moderation system isn't the one that catches the most bad content. It's the one that justifies its decisions to a judge"
Moderation is the most politically-loaded ML systems design problem. The answer is never just “classifier with a threshold.” It is a cascade of rules, classifiers, and human judgment, with different latency budgets at each stage, different cost functions for different content classes, and an appeals path that turns the system into one with due process. Candidates who answer this well demonstrate that they understand policy as engineering — that “false positive cost” is not symmetric with “false negative cost” and that the asymmetry changes the architecture.
This chapter works the canonical version: a user-generated-content moderation pipeline for a large social platform, handling text, images, and video, with multiple content classes ranging from spam (low-stakes) to CSAM (catastrophic-stakes).
Outline:
- Clarify: the asymmetric cost model.
- Estimate: scale and the cascade budget.
- High-level architecture — the four-stage cascade.
- Drill 1: the classifier layer.
- Drill 2: LLM-as-judge and human review.
- Latency budgets, queueing, and the sync/async split.
- Appeals, audit trails, and the due-process layer.
- Metrics, SLOs, and the two-dimensional confusion matrix.
- Failure modes and rollout.
- Tradeoffs volunteered.
- The mental model.
118.1 Clarify: the asymmetric cost model
The candidate opens with a different set of clarifying questions than the chatbot or RAG problems, because moderation has dimensions the others don’t.
1. What content modes — text, images, video, audio, all? Assume text + images + video. Audio out of scope for v1.
2. What content classes, and what are the relative stakes? The interviewer lists: CSAM, violent extremism, hate speech, harassment, spam, nudity/sexual content, self-harm, misinformation. The candidate groups them:
- Catastrophic (CSAM, violent extremism): false negatives are unacceptable; false positives are tolerable (temporarily block, human review fast).
- Severe (self-harm, targeted harassment): false negatives are very bad; false positives are bad but recoverable.
- Moderate (hate speech, nudity outside policy): bidirectional harm; appeals path matters.
- Low stakes (spam, mild policy violations): false positives dominate the user experience; throughput matters more than precision.
3. Sync or async — does the user post the content immediately and we review later, or do we block the post until review? Most platforms use a hybrid: sync for catastrophic class (block at post), async for moderate and low stakes (post now, review in background, take down if needed).
4. What’s the scale? 100M posts/day, peak 3000 posts/second. Images and video add another 20M uploads/day. Total content items ~120M/day.
5. Latency target for sync stage? User-facing post action must complete within 1 second. So the sync moderation budget is <300 ms after gateway and upload.
6. Who reviews flagged content? In-house reviewers + contracted third-party. Multi-language, multi-region, 24/7.
The key note the candidate writes down: CSAM has a zero-false-negative policy, which makes the architecture fundamentally different from “normal” classification problems.
118.2 Estimate: scale and the cascade budget
The chain:
- Posts per day: 120M items.
- Average rate: 120M / 86,400 ≈ 1400 items/sec.
- Peak rate: ~3000 items/sec.
- Per-item compute budget (sync stage, ~300 ms, CPU+GPU combined): at 3000/sec, total fleet capacity must be >900 core-seconds/sec, i.e., at least 900 CPU cores plus GPU for classifiers.
Cascade funnel. The whole point of a cascade is that each stage filters traffic, so downstream stages only see a fraction. Typical funnel for a moderation pipeline:
| Stage | Input rate | Output rate | Budget | Cost |
|---|---|---|---|---|
| Rules / hashing | 3000/s | 3000/s | <5 ms | cheap |
| Small classifier | 3000/s | ~200/s (6% flag) | <50 ms | medium |
| LLM judge | 200/s | ~40/s (20% flag) | <500 ms | expensive |
| Human review | 40/s (0.003% of all) | final decisions | minutes | most expensive |
The cascade takes 3000/sec of input and funnels it to 40/sec of human-review volume — a 75× reduction per stage. The point: each stage has to be two orders of magnitude cheaper than the next, otherwise the cascade is pointless.
graph LR
IN["3000 items/sec\n(peak)"] --> R1["Stage 1: Rules + Hashing\n<5 ms · cheap CPU\nCSAM hash: REJECT now"]
R1 -->|"~200/sec flagged"| R2["Stage 2: Small Classifier\n<50 ms · GPU\nDistilBERT / ViT"]
R2 -->|"~40/sec uncertain"| R3["Stage 3: LLM Judge\n<500 ms · async\nLlama-Guard class"]
R3 -->|"~8/sec human queue"| R4["Stage 4: Human Review\nminutes–hours SLA\n~1800 reviewer-equivalents"]
R1 -->|"2800/sec pass"| ALLOW["ALLOW\n(sync, <300 ms)"]
R2 -->|"160/sec benign"| ALLOW
R3 -->|"32/sec benign"| ALLOW2["ALLOW\n(async, 1–5 min)"]
R4 --> FINAL["Enforce / Allow / Appeal"]
style R1 fill:var(--fig-accent-soft),stroke:var(--fig-accent)
Each cascade stage is ~100× cheaper in cost per item than the next — if any stage’s cost approaches the next, the cascade loses its economic justification.
- Human reviewers needed: 40/sec × 30 seconds per review × 1.5 for breaks = ~1800 reviewers-equivalent during peak. Distributed globally, shifted, with specialization by content class.
GPU count for classifiers. A small classifier (DistilBERT-class, 60M parameters) does ~5000 items/sec per GPU. At 3000/sec peak, one GPU suffices for throughput, but latency and redundancy push to 4+ replicas. For images, a ViT-class model does ~500/sec per GPU; at 300 image-posts/sec, that’s ~1 GPU, round to 4. For video, a frame-level model + motion features runs at ~5 videos/sec on a single GPU; at 300 videos/sec peak, that’s ~60 GPUs for video alone. Video is the expensive class.
LLM judge. A small LLM (8B class) at ~5000 tokens/sec on one H100. Each judge call is ~500 tokens. At 200/sec judge rate, that’s 100k tokens/sec, requiring ~20 H100s worth of capacity — but judge calls are async-tolerant, so the fleet can be much smaller and queue.
Cost: video classifiers dominate. ~60 GPUs × $2/hour × 720 hours = $86k/month for video. LLM judge ~$15k/month. Classifiers ~$10k/month. Human review ~$500k/month at $15/hour × 1800 reviewer-equivalents × 24 × 30 (partial-coverage math). Moderation is human-labor bound, not compute-bound — an important realization.
118.3 High-level architecture — the four-stage cascade
[ user upload / post ]
|
v
[ upload gateway + CDN ]
|
v
[ blob store (S3) ] -- for media; text inline
|
v
[ moderation router ]
|
v
============ SYNC STAGE (budget ~300 ms) =============
[ stage 1: rules + hash match ]
- regex blocklists
- PhotoDNA / CSAM hash DB (NCMEC)
- known-bad-URL DB
- ModelDB (offender signatures)
- if hard hit (CSAM hash): REJECT + report to NCMEC
- if soft hit: escalate to stage 2
|
v
[ stage 2: small classifier (per modality) ]
- text: DistilBERT or small transformer per class
- image: ViT-Safety or CLIP-based classifier
- video: keyframe + audio model, async preflight
- if high confidence benign: ALLOW
- if high confidence violation: ASYNC to human; notify user
- if uncertain: escalate to stage 3
============ END SYNC (return to user) ===============
|
v (async)
============ ASYNC STAGE (budget minutes) =============
[ stage 3: LLM judge ]
- vision-language model for images
- text LLM with policy prompt
- outputs structured verdict + reasoning
- confident benign: ALLOW
- confident violation: QUEUE to human
- ambiguous: QUEUE to human (harder case queue)
|
v
[ stage 4: human review ]
- routed by class, language, severity
- SLA: 5 min for catastrophic, 2 hour for moderate, 24h for low
- decision: allow / remove / warn / ban / escalate
|
v
[ enforcement action ]
- delete, hide, warn, ban user, report
- emit audit event
|
v
[ appeals path ]
- user contests decision
- re-review by different reviewer or senior
- update model training data
Technologies labeled:
- Moderation router: custom service; routes based on content type and user risk score.
- Rules engine: a mix of regex library (re2), PhotoDNA for CSAM, internal hash DBs.
- Classifiers: TEI or Triton inference server on GPU pods; per-class models.
- LLM judge: vLLM fleet with a Llama-Guard-like model and a vision model for image policy.
- Human review tooling: a custom review UI with built-in policy guidance, decision logging, and reviewer health metrics (fatigue, time-per-review).
- Audit log: Kafka → object storage; immutable, queryable by user.
- Appeals: separate service, separate queue with higher-skill reviewers.
The interviewer says “drill into the classifier layer.”
118.4 Drill 1: the classifier layer
The job. Given a piece of content, return a multi-label probability vector over policy classes within 50 ms p95, with enough precision that high-confidence decisions don’t need further stages.
Architecture. One classifier head per content class per modality, sharing a backbone where possible. For text: a DistilBERT backbone with one linear head per class, joint-trained on a multi-task dataset. For images: a CLIP backbone frozen, with a small MLP head per class. For video: keyframe sampling + the image classifier, plus an audio classifier on the audio track.
Calibration and thresholds. Raw classifier probabilities are not calibrated. The first step in production is temperature scaling or Platt scaling on a held-out calibration set, so that a score of 0.9 actually means “90% probability.” Miscalibrated scores destroy the cascade because the “confident benign” and “confident violation” thresholds become unreliable.
Per-class thresholds are set differently for different severity classes:
| Class | Benign threshold | Violation threshold |
|---|---|---|
| CSAM | (hash match is primary) | 0.3 — err on escalation |
| Violent extremism | 0.1 | 0.4 |
| Hate speech | 0.2 | 0.7 |
| Harassment | 0.2 | 0.8 |
| Spam | 0.1 | 0.9 |
Note the asymmetry. For CSAM, we’d rather escalate a borderline case to humans than miss it. For spam, we’d rather tolerate some false negatives than annoy users with false positives on their jokes. The thresholds encode the policy.
Calibration drift. Real-world content shifts over time. A classifier trained on 2024 data underperforms on 2025 slang, memes, and visual trends. Mitigation: continuous retraining on flagged content with human labels, weekly model refresh with A/B eval on a held-out set.
Multi-model ensemble for high-stakes classes. For CSAM and violent extremism, ensemble multiple models (one primary, one fallback) and require both to agree before a “confident benign” decision. Any disagreement escalates. This doubles compute for those classes but is justified by the zero-false-negative requirement.
Adversarial robustness. Adversaries deliberately evade classifiers — rotate images, add noise, paraphrase text. The arms race is real. Mitigations: train with adversarial augmentations (CLIP-based text attacks, visual perturbations), ensemble with diverse backbones, run semantic fingerprinting (perceptual hashes) in parallel with classification.
Latency and throughput math per modality:
| Modality | Model | GPU | Throughput | Latency p50 | Latency p95 |
|---|---|---|---|---|---|
| Text | DistilBERT + heads | 1× A10 | ~8000/s | 5 ms | 15 ms |
| Image | CLIP-base + MLP | 1× A10 | ~500/s | 20 ms | 50 ms |
| Video | Keyframe × 10 + audio | 1× A10 | ~5/s | 300 ms | 1 s |
Video is the outlier. For sync latency budget of 300 ms, video cannot wait for full classification. Instead: sync stage returns “provisionally allowed” based on a fast preflight (first keyframe + filename + user risk score), and full classification runs async in <1 minute. If the full classification then fires violation, the video is pulled down async. This is the hybrid sync-for-fast/async-for-full pattern that all major platforms use for video.
118.5 Drill 2: LLM-as-judge and human review
LLM judge. After the classifier flags ambiguous content, a larger model (text LLM or VLM) scores it with a structured prompt like:
You are a content policy reviewer. Given the following post, classify it
against each of these policy classes: [hate_speech, harassment, self_harm,
violent_extremism, spam, other]. For each class, return:
- verdict: allow | remove | escalate
- confidence: 0-1
- reasoning: brief
Post: "..."
User context: [age, history, location]
The LLM returns structured JSON (enforced by guided decoding, Chapter 43). A judge call is ~500 input tokens + ~200 output tokens on a 70B-class model, ~2 s per call at batch size 1, ~200 ms per call at high batch. At 200/sec peak judge rate, that’s ~40 concurrent calls on one vLLM replica — 2× H100 comfortably.
Why LLM judge instead of just bigger classifier? Two reasons. First, LLMs generalize better to policy edge cases that weren’t in training data. Second, the LLM produces reasoning, which is required for the audit trail. A human reviewer seeing “flagged for hate speech” is unactionable; a human seeing “flagged for hate speech because the phrase X is a slur targeting group Y” can act.
Human review. The most expensive stage. Reviewers are routed by:
- Content class specialization: CSAM reviewers are a different, highly-trained pool with extensive mental-health support. General reviewers handle spam and moderate content.
- Language: language-native reviewers are required for nuanced classes.
- Risk level: harder cases go to senior reviewers.
The review UI shows the content, the classifier verdict, the LLM judge reasoning, the user’s history, and a set of policy buttons. Target time per review: 15–30 seconds for routine, 2–3 minutes for difficult.
Reviewer health is a metric. Human reviewers exposed to graphic content burn out, and burnout leads to bad decisions. Metrics to watch: decisions per hour, inter-reviewer agreement, consecutive hours worked, reviewer-initiated breaks. Alerts fire on individual reviewer fatigue. This is policy-as-engineering at its purest: the humans in the loop are themselves a component that needs monitoring.
Inter-annotator agreement. A random sample of decisions is reviewed by a second reviewer. Agreement rate should be >85% for the system to be trustworthy. If it drops, the policy guidance needs clarification or the reviewers need retraining.
118.6 Latency budgets, queueing, and the sync/async split
The sync budget is 300 ms. The classifier layer must fit in it. Everything past stage 2 is async.
Sync queue depth. If stage 1 and stage 2 together exceed their budgets, requests back up. A bounded queue with aggressive backpressure is required. The admission policy: if the queue depth exceeds 1000 items, new items are auto-allowed with async-only review (low-stakes fallback). This protects the user experience at the cost of accepting that some borderline content gets through in the worst case. For catastrophic classes, no fallback — those always wait.
Async queues. Kafka topics, one per severity level. LLM judge consumers pull from each topic with priority weighting. Human review queue fed from LLM judge topic, partitioned by content class and language.
Backpressure and load shedding (Chapter 77). When the LLM judge fleet is overloaded, the async side doesn’t fail — it queues. The queue can absorb bursts up to ~10 minutes of peak traffic. Beyond that, content is held longer than its SLA, which means delayed takedowns. Mitigation: scale the judge fleet aggressively on queue depth, and alert on SLA breaches.
Duration between post and decision.
| Path | Latency |
|---|---|
| Sync allow (80% of traffic) | <300 ms |
| Sync reject (0.01% — CSAM hash hit) | <300 ms |
| Async allow after classifier (8%) | 10–60 seconds |
| Async allow after LLM judge (10%) | 1–5 minutes |
| Async takedown after human review (2%) | 5 min – 24 hours depending on severity |
The user experience: most content posts instantly. A small fraction is taken down later, with a notification to the user and an appeal link.
118.7 Appeals, audit trails, and the due-process layer
Why this matters. Moderation without appeals is censorship. Moderation with appeals is governance. Regulatory pressure (DSA in Europe, various US state laws) now mandates appeal processes. Senior interviewers test on this because candidates who skip it haven’t actually built one.
Audit trail. Every decision is logged to Kafka → object storage as an immutable event:
graph LR
DEC["Decision\n(allow/remove/warn)"] --> AUDIT["Audit Event\nKafka → Object Store\nimmutable append-only"]
AUDIT --> REPORT["Transparency Report\nquarterly · per-class counts"]
AUDIT --> RETRAIN["Classifier Retrain\nhuman labels → training data"]
AUDIT --> APPEAL["Appeals Path\nuser contests → re-review"]
APPEAL -->|"overturned"| RETRAIN
APPEAL --> FINAL2["Final Decision\n+ user notification"]
style AUDIT fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The audit log is the foundation of the due-process layer — transparency reports, retraining pipelines, and regulator responses all read from the same immutable stream.
{
"decision_id": "...",
"content_id": "...",
"user_id": "...",
"action": "remove",
"stage": "human_review",
"reviewer_id": "...",
"classifier_scores": {...},
"llm_judge_reasoning": "...",
"policy_class": "hate_speech",
"timestamp": "...",
"policy_version": "2026.04"
}
Audit events are queryable by user (for transparency reports), by policy class (for model retraining), and by date (for regulator response).
Appeals path. A user contests a removal:
- Appeal submitted via the user-facing UI, stored in a separate appeals queue.
- Reviewed by a different human, with access to the full audit trail.
- Appeal decision is recorded. If overturned, the content is restored, the original decision is flagged for model retraining.
- User is notified of the final decision.
Appeals SLA. Target: 48 hours for review, 24 hours for urgent categories (accounts banned). Appeals that are overturned are a quality signal for the primary pipeline — a spike in overturn rate for a class signals policy drift or a model regression.
Policy versioning. Every decision carries the policy version it was made under. When policy changes, old decisions are not retroactively applied (except for CSAM and related, which are universal). This is legally and operationally important.
118.8 Metrics, SLOs, and the two-dimensional confusion matrix
The naive metric is accuracy. The senior metric is the confusion matrix, decomposed by content class.
Per-class metrics:
| Metric | What it measures | SLO |
|---|---|---|
| Recall (true positive rate) | Of all bad content, what fraction did we catch? | varies by class |
| Precision | Of all flagged content, what fraction was actually bad? | varies |
| False negative rate | Slipping bad content through | 0% for CSAM; <5% for severe |
| False positive rate | Wrongly flagging benign content | <2% for low stakes |
| Appeals overturn rate | Fraction of appeals the system gets wrong | <10% |
| Time-to-decision p95 | From post to final action | class-dependent |
| Human review fatigue signal | Per-reviewer health | internal |
The platform-wide SLOs:
- Catastrophic content time-to-removal: p99 < 5 minutes.
- User-facing post latency: p95 < 1 second.
- Human review SLA compliance: 95%.
- Appeals review SLA compliance: 95%.
- Transparency report accuracy: quarterly, audited.
Online metrics from user behavior. Did the user delete the content themselves after receiving a warning? Did they appeal? Did they reoffend? These feed back into user risk scores and the training data for classifiers.
118.9 Failure modes and rollout
Top failure modes:
- Classifier regression after retrain. A new model version drops recall on a critical class. Mitigation: mandatory golden-set eval with zero-regression-on-CSAM policy before any rollout. Canary at 1% of traffic with shadow eval.
- LLM judge goes off-policy. The LLM starts generating unsafe outputs because of prompt injection inside user content. Mitigation: strict input sanitization, structured output parsing, never let the judge’s output influence user-facing text.
- Reviewer labeling drift. Reviewers in one region interpret a policy differently than another, causing inconsistency. Mitigation: cross-region calibration meetings, shared gold set, periodic re-training.
- Appeals backlog. A spike in removals causes a spike in appeals, which exceeds reviewer capacity. Mitigation: reviewer surge capacity, SLA degradation alarms, automatic reversion of class A decisions if appeals backlog > 3 days.
- Hash DB update lag. A new CSAM hash is added to NCMEC’s DB but not yet fetched. Mitigation: hourly refresh, fail-closed on DB fetch errors.
- Adversarial attack campaign. A coordinated campaign floods the system with adversarial variants of banned content. Mitigation: anomaly detection on volume by user/IP/content fingerprint; automated rate-limiting and escalation to a human incident team.
Rollout. New models roll out in three steps: shadow traffic (model runs but doesn’t influence decisions, outputs compared to existing), canary (1% traffic), full. Zero-tolerance rollback on any regression in catastrophic-class metrics. Lower-stakes regressions (spam precision dropped 2%) are tolerated if the overall F1 improved.
118.10 Tradeoffs volunteered
- Sync vs async per class: sync for CSAM hash hit, async for everything else for latency.
- Classifier ensemble vs single model: ensemble for CSAM; single for spam.
- DistilBERT vs larger classifier: DistilBERT for throughput; larger model if recall target can’t be hit.
- LLM judge vs specialized classifier: judge for nuance and reasoning; classifier for throughput.
- Fail-open vs fail-closed: fail-open on low-stakes when services are down; fail-closed on catastrophic.
- Per-user risk score vs stateless: per-user for repeat offenders; stateless is simpler but misses patterns.
- In-house reviewers vs outsourced: in-house for high-stakes and appeals; outsourced for volume.
- Hash DB vs classifier for CSAM: hash DB is primary (legally mandated cooperation with NCMEC); classifier is secondary for novel content.
118.11 The mental model
Eight points to take into Chapter 119:
- Cost asymmetry drives architecture. False positives and false negatives have different costs, and the cost function varies by content class.
- The cascade funnels traffic by two orders of magnitude per stage. Rules → classifier → LLM judge → human.
- Each stage’s budget is ~100× the next in latency and ~100× less in volume. Design the funnel to match.
- Sync vs async split is per-class. CSAM hash is always sync; video full classification is always async.
- Calibration is the hidden requirement. Without calibrated probabilities, the cascade thresholds are meaningless.
- Appeals and audit trails are not optional. Regulators test for them; senior interviewers test for them.
- Human reviewers are a monitored component. Fatigue, inter-annotator agreement, and burnout are first-class metrics.
- Rollouts are shadow + canary with zero-regression-on-catastrophic gates. No exceptions.
Chapter 119 moves to a very different design question: real-time recommendations, where the ML is classical ranking models, not LLMs, and the interesting parts are the feature store and the retrieval/ranking split.
Read it yourself
- The paper “Introducing ML-based Content Moderation at Scale” — most major platforms (Meta, YouTube, TikTok) have public engineering blog posts on their moderation pipelines; read two or three.
- Inan et al., Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations — for the LLM-as-guardrail pattern.
- The NCMEC CyberTipline documentation and PhotoDNA overview.
- The EU Digital Services Act (DSA) summary sections on due process and transparency.
- Sarah T. Roberts, Behind the Screen: Content Moderation in the Shadows of Social Media — the human reviewer experience.
- AWS Bedrock Guardrails and Azure Content Safety API documentation for managed-service patterns.
Practice
- Re-estimate the human reviewer count if the rate of flagged content doubles because a new adversarial campaign floods the system.
- Design the data pipeline that produces the training set for next month’s classifier retrain, including how human review decisions feed back in.
- The regulator requires a transparency report: for each content class, how many items were removed, how many were appealed, how many appeals were overturned. Design the data path to produce this monthly.
- A new content class arrives (e.g., deepfake impersonation). Walk through the full rollout: training data, classifier training, eval, deployment, reviewer training.
- A video arrives with adversarial perturbations that fool the image classifier. Design the robustness plan.
- The sync budget is cut from 300 ms to 100 ms. What changes in the cascade? Which stages must be re-optimized?
- Stretch: Design the same pipeline for a small platform (10k posts/day) where human review is cheaper than GPUs. How does the cascade invert? Which stages disappear, and which become the bottleneck?