Part I · ML Foundations
Chapter 10 Intermediate ~22 min read

The model card: reading model lineage adversarially

"Model cards are marketing documents written by the people who want you to use the model"

This is the closing chapter of Part I, and it’s the most practical one. Every new model release ships with a model card: a markdown file describing the architecture, training data, intended use, license, and evaluation results. The card is the first thing you’ll see when deciding whether to deploy a model. It is also the place where the model author has the strongest incentive to put the model in the best light, leave out the inconvenient parts, and emphasize the cherry-picked numbers.

This chapter is about how to read a model card the way a senior engineer reads it: adversarially, with a checklist of things to look for, things to discount, and things that are conspicuously missing. By the end you will be able to walk into a model card cold and, in five minutes, predict serving cost, license risk, training-data quirks, and likely production failure modes.

Outline:

  1. What a model card actually contains (in the best case).
  2. Architecture and parameter count — what to extract.
  3. Context length: advertised vs effective.
  4. Training data — what they tell you and what they don’t.
  5. License — the four categories you need to know.
  6. The lineage graph: base → SFT → DPO → quantized.
  7. How to spot the base model under a chat veneer.
  8. Quantization variants and what they cost in quality.
  9. Eval claims and how to discount them.
  10. The five-minute model card walkthrough.
  11. The non-obvious failure modes you only learn from deploying.

10.1 What a model card actually contains

The “model card” idea was formalized by Mitchell et al. (Model Cards for Model Reporting, 2018) at Google as an attempt to standardize the documentation of ML models with respect to intended use, performance characteristics, and limitations. The HuggingFace community adopted the term for the README that ships with every model on the Hub.

In practice, a modern open LLM model card has some subset of the following:

  • Model type and architecture. “Causal decoder-only transformer,” “32 layers, 32 attention heads, 4096 hidden dim,” etc.
  • Parameter count. Often headline.
  • Tokenizer and vocab size.
  • Training data. Usually vague.
  • Training compute and methodology. Pretraining, SFT, DPO, RLHF — which stages, at what scale.
  • Context length.
  • License.
  • Intended use and limitations.
  • Evaluation results. Tables of benchmark scores.
  • Bias and safety statements. Usually boilerplate.
  • Citation. BibTeX for the paper.
  • Quick-start code snippet.

The good cards (Llama 3, DeepSeek-V3, Qwen 2.5) are detailed and honest about their limitations. The bad cards (some startup releases, some hobbyist fine-tunes) are essentially marketing copy with a few benchmark numbers. The skill is in extracting signal from both.

10.2 Architecture and parameter count — what to extract

When you open a model card, the first numbers to find:

  • Total parameters. This sets the baseline for memory and compute. A 7B model needs ~14 GB of VRAM in bf16; a 70B needs ~140 GB; a 405B needs ~810 GB. These are minimum-to-load-the-weights numbers, not training or full inference numbers. Cross-check Chapter 4’s memory math.
  • Active parameters (for MoE models). For a Mixture-of-Experts model like DeepSeek-V3 (671B total, 37B active), the active count is what determines per-token compute. The total count determines storage. The two are wildly different and the difference matters for everything from serving cost to interview questions.
  • Number of layers, heads, hidden dim, FFN dim. These let you compute KV cache size per token (Chapter 22) and roughly predict throughput on a given GPU.
  • Attention variant. MHA, MQA, GQA, MLA. This matters enormously for KV cache size. A model with GQA and 8 KV heads will have an 8× smaller KV cache than the same model with full MHA. We cover this in Chapter 33; for now, just look for it in the card.

Specific things to look up if not stated:

  • For MoE models: how many experts, how many are activated per token (top-k routing), is there a shared expert?
  • For long-context models: what RoPE base frequency, what max position?
  • For tied vs untied embeddings: does the LM head share weights with the input embedding?

If the card doesn’t say, the model’s config.json on HuggingFace usually does. Read it.

10.3 Context length — advertised vs effective

This is the single most over-claimed property of modern LLMs.

A model card might say “supports 1M tokens of context.” This is, in almost every case, the maximum position the model was trained on, or the maximum position it can accept without crashing. It is not the maximum context at which the model produces high-quality answers.

Advertised context length versus effective context: retrieval quality degrades past 1/4 to 1/2 of advertised, and reasoning degrades even faster. quality context position (tokens) 128k 0 advertised retrieval (needle) reasoning (RULER) ~32k effective (25% of advertised)
Reasoning quality (accent curve) collapses well before the advertised context limit — a model claiming 128k context reliably handles multi-hop reasoning only to about 32k, a quarter of what the card says.

The discrepancy is large. There’s a famous benchmark called needle-in-a-haystack that places a piece of information (the needle) at a random position in a long context (the haystack) and asks the model to retrieve it. Most “long context” models do well with the needle near the start or near the end and badly with the needle in the middle — this is the lost in the middle phenomenon (Liu et al., 2023).

A more demanding benchmark is RULER (Hsieh et al., 2024), which tests whether a model can do multi-hop reasoning over the long context, not just single-fact retrieval. RULER scores collapse much faster than needle-in-a-haystack scores. A model that does well to 128k on needle-in-a-haystack might do well only to 32k on RULER.

The practical rule: the effective context is one quarter to one half of the advertised context for retrieval-style tasks, and even less for reasoning-style tasks. If a card claims 1M context, plan around 256k. If it claims 128k, plan around 32–64k. If it claims 32k, you can probably trust it.

We will cover the technical reasons for this (RoPE extrapolation, attention bandwidth limits) in Chapter 35.

10.4 Training data — what they tell you, and what they don’t

The training data section of a model card is almost always vague. The reasons are partly legal (copyright lawsuits) and partly competitive (data is the moat).

Things you might see:

  • “Trained on a mix of publicly available web data, books, and code.”
  • “Approximately 15 trillion tokens.”
  • “Filtered for quality and safety.”

Things you almost never see, but that matter enormously:

  • The actual sources, by name.
  • The deduplication strategy.
  • The quality filtering thresholds.
  • The language mix.
  • The proportion of synthetic vs human-generated data.
  • Whether eval datasets were excluded from training (this is the contamination question, and it’s the main reason benchmark numbers should be discounted).

When you can’t get the data details from the card, read the paper, if there is one. Llama 3, Qwen 2.5, DeepSeek-V3 all have technical reports that go far beyond what’s in the model card. The paper is also where you’ll find the data ablations — “we tried with X and without X and the difference was Y.”

Two things to check specifically:

(1) Language mix. A model trained on 95% English will produce English-language output even when you prompt it in another language, and its non-English benchmark numbers will be unreliable. If you’re deploying in a multilingual setting, check the language distribution. Llama 3 was 95% English; Qwen 2.5 was much more multilingual; Aya is explicitly multilingual.

(2) Code mix. If you’re going to use the model for code generation, check whether the training mix includes code. A “general” LLM trained on 5% code will be much worse at code than a “code” LLM trained on 50% code. The boundary between general and code models is fuzzy, but the underlying training-data mix is what determines it.

10.5 License — the four categories

This is the place where production engineers get burned the hardest. Read the license. Then read it again. Then read the community license addendum, if there is one.

The four categories you’ll meet:

Apache 2.0 / MIT / BSD-3

The “permissive” open licenses. You can use the model for anything — commercial, research, derivative works, redistribution — as long as you preserve the copyright notice. Mistral 7B (the original release), Qwen 2.5, DeepSeek, Phi, Gemma 3 (with caveats), most open embedders.

Safe for production. The model is yours.

Llama Community License

Meta’s custom license for the Llama family. Not an OSI-approved open license, despite being publicly available. Key points:

  • Free to use including commercially for most users.
  • But: if you have more than 700 million monthly active users at the time of the release, you need a special license from Meta. This excludes the largest tech companies.
  • Requires attribution (“Built with Llama”).
  • Restricts use to improve other models. Llama outputs cannot be used to train a competing LLM.

For 99% of companies, the Llama license is fine, but the “no training competitor models” clause has bitten people. Always check whether your fine-tunes or distillations might be considered “competing models.”

Gemma License

Google’s Gemma family has its own license, similar to Llama but with subtly different terms. Allows commercial use, requires attribution, restricts certain use cases.

Research-only / non-commercial

Some models are released for research purposes only. The classic example is the original LLaMA 1 (which leaked anyway). Others include some academic releases, some safety-research models, and a few specific frontier models. Cannot be used in commercial products, full stop. Even for evaluation, the license terms might restrict serving them in any production-like setting.

Proprietary / API-only

GPT-4, Claude, Gemini Pro — these aren’t downloadable at all. You access them via API, and the terms of use are part of the API agreement. License questions become questions about data flowing through the API, retention, training-on-customer-data clauses, etc.

The discipline: before you get attached to a model in evaluation, check the license. Many engineers have wasted weeks tuning a model that turned out to be non-deployable for license reasons. The license is the first filter, not the last.

10.6 The lineage graph: base → SFT → DPO → quantized

Modern open LLMs come in families. A typical lineage:

Llama-3.1-8B (base, pretrained)
    ├── Llama-3.1-8B-Instruct (SFT + DPO, the chat-tuned version)
    │       ├── Llama-3.1-8B-Instruct-AWQ (4-bit AWQ quantized)
    │       ├── Llama-3.1-8B-Instruct-GPTQ (4-bit GPTQ quantized)
    │       └── Llama-3.1-8B-Instruct-FP8 (8-bit FP8 quantized)
    ├── Llama-3.1-8B-CodeAlpaca (third-party fine-tune)
    └── Llama-3.1-8B-Guard (safety-tuned variant)

Reading a model card requires placing the model in its lineage. The questions:

Model lineage graph: base pretrained model forks into instruct variant then quantized derivatives and third-party fine-tunes. Llama-3.1-8B base (pretrained) Llama-3.1-8B-Instruct SFT + DPO ← use this -Instruct-AWQ INT4 quantized -Instruct-GPTQ INT4 quantized 8B-CodeAlpaca third-party fine-tune base license propagates to all derivatives — MIT label on a Llama fine-tune is legally questionable
Every node in the lineage inherits the base model's license — a community fine-tune labelled "MIT" that sits on Llama weights is still bound by Meta's community license, regardless of what the card says.
  • Is this a base model or an instruct model? A base model is the raw output of pretraining — it does next-token prediction but doesn’t know how to follow instructions. You almost always want the instruct variant for production. The base is for further fine-tuning.
  • What chat template does it use? Different points in the lineage use different templates. Always use the tokenizer’s apply_chat_template (Chapter 5).
  • Was it instruction-tuned, RLHF’d, DPO’d, or all three? This affects how steerable it is and what kind of system prompts work.
  • Is this quantized? At what precision? Using what scheme?
  • Is this a fine-tune of a base model from a different family? Watch for model cards that say “based on Llama” but ship with a different tokenizer or chat template — those are usually heavy fine-tunes that may behave very differently.

The lineage is also where most “this model is really good at X” claims come from. A specialist fine-tune of Llama-3.1-70B on Korean medical text will, unsurprisingly, be very good at Korean medical questions. That doesn’t mean the underlying Llama-3.1-70B has those capabilities; it means the fine-tuner injected them. Be careful about generalizing.

10.7 How to spot the base model under a chat veneer

Sometimes a model card says “our brand new amazing chat model” without telling you what base it was built on. The base model determines almost everything important: architecture, tokenizer, vocab, license, original capabilities. If the card doesn’t tell you, you have to detect it.

Three ways to do it:

(1) Look at config.json. The HuggingFace config.json for a fine-tune almost always inherits architecture details from the base. The number of layers, heads, hidden dim, vocab size, and (sometimes) the model type field are dead giveaways. A model with num_hidden_layers=80, hidden_size=8192, num_attention_heads=64 and a Llama-shaped config is a Llama-3.1-70B fine-tune, no matter what the card calls it.

(2) Look at the tokenizer. Run tokenizer.encode("Hello, world!") and compare the token IDs to known base tokenizers. If they match Llama 3’s tokenizer, it’s a Llama 3 fine-tune. If they match Mistral’s, it’s a Mistral fine-tune. If they match Qwen’s, it’s a Qwen fine-tune.

(3) Probe with a known base prompt. Base models complete text in characteristic ways. A Llama base will often complete "The capital of France is" with a confident " Paris." followed by random Wikipedia-style text. A Qwen base will do something similar but with a slightly different cadence. After enough exposure you can identify base models by feel.

Why does this matter? Because the base determines the inherited license. A model released under “MIT” by a third party is still bound by the base model’s license if it’s a fine-tune. A “MIT-licensed Llama fine-tune” is legally questionable — the fine-tuner can’t relicense Meta’s weights. Always trace to the actual base before making license decisions.

10.8 Quantization variants — what they cost in quality

Most popular models ship in multiple quantization variants. The card might say “available in fp16, fp8, int8, int4 (AWQ), int4 (GPTQ).” Each variant has different memory and quality characteristics.

The rough quality story (which we’ll cover in detail in Chapter 26):

PrecisionMemory vs fp16Quality costNotes
fp16 / bf16none (the baseline)Standard for serving.
FP8 (e4m3)0.5×very smallH100/H200 hardware support, calibration matters.
INT80.5×smallOlder but still common.
INT4 (AWQ)0.25×small to moderateThe current best 4-bit scheme for most models.
INT4 (GPTQ)0.25×small to moderateSimilar to AWQ; varies by model.
INT4 (bitsandbytes nf4)0.25×moderateUsed in QLoRA training; not always best for serving.

The honest version of the quality cost is that smaller models lose more. A 70B model quantized to INT4 might score 1–2 points lower than the bf16 version on standard benchmarks. A 7B model quantized to INT4 might lose 5–10 points. The reason: smaller models have less redundancy, so each individual weight matters more.

The card might claim “no quality loss from quantization.” Discount this claim. There is always some loss; the question is whether it’s small enough to matter for your use case. The right way to evaluate is to run the quantized version on your own evaluation set (Chapter 20) and compare to the bf16 baseline. Standard benchmarks miss the regressions that matter most for production.

10.9 Eval claims and how to discount them

The benchmark table is the most heavily cherry-picked part of any model card. To read it skeptically:

Benchmark scores are an upper bound: real-world performance degrades due to contamination, prompt sensitivity, and distribution shift. Card claims (upper bound) MMLU: 82 GSM8K: 91 ↓ discounted for • contamination? • 5-shot vs 0-shot? • custom harness? • cherry-picked tasks? • your domain missing? Real-world performance on your eval set: always lower
Benchmark scores are the upper bound — contamination, prompt format differences, and distribution shift all pull real-world performance lower, which is why the five-minute card walkthrough ends with "should I evaluate on my own data," not "should I deploy."

(1) Compare apples to apples. A model card claiming “65 on MMLU” might be reporting 5-shot, while the model it’s compared against was tested at 0-shot. The numbers are not comparable. Always check the few-shot count, the sampling settings, and the evaluation framework (lm-eval-harness vs the model author’s custom harness vs HELM).

(2) Watch for “vs older versions.” A new model that beats a 6-month-old model on benchmarks is a low bar. Compare to the current state of the art at the same parameter count.

(3) Beware of contamination. If the model authors trained on the test set (deliberately or accidentally), the score is meaningless. Some benchmarks (MMLU, ARC) are heavily contaminated; some newer ones (GPQA, MMLU-Pro, Arena-Hard) are harder to game. Recent strong scores on the harder benchmarks are more believable.

(4) Watch for missing benchmarks. If a model card shows 8 benchmarks but is conspicuously missing the one that’s most important for your use case, that’s a signal. Either the authors didn’t run it (unlikely for a major release) or the score was bad.

(5) Watch for “average.” A model with a high average across 12 benchmarks but a low score on the one you care about is worse than a model with a lower average but a high score on yours.

(6) Look for arena and human eval. Benchmarks like LMSYS Arena (now LM Arena) test models against each other in human pairwise comparisons. The arena leaderboard is one of the most contamination-resistant signals available, because the test prompts are generated by real users, change continuously, and the comparisons are blind.

We’ll cover evaluation in much more depth in Chapter 20. For now, the rule is: the benchmark numbers in the card are the upper bound on the model’s performance. Real-world performance is always lower.

10.10 The five-minute model card walkthrough

The discipline. Open a model card. Set a timer. Five minutes.

Minute 1 — Architecture and size. Find the parameter count, the number of layers, heads, hidden dim, and the attention variant. Compute the bf16 memory footprint. Decide if you can serve it on the hardware you have.

Minute 2 — Lineage and chat template. Find the base model. Find the fine-tuning stages (SFT, DPO, RLHF). Find the chat template. Decide if the lineage is clean enough that you trust it.

Minute 3 — License. Read the license. Identify which of the four categories it falls in. Check for the “competing models” clause and any usage restrictions. Decide if it’s safe for your use case.

Minute 4 — Training data and language mix. Find the data section. Read it skeptically. If it’s missing, find the paper. Identify the language mix and the code proportion. Decide if it’s a fit for your domain.

Minute 5 — Eval scores. Skim the benchmark table. Check the few-shot and harness settings. Note any benchmark that’s conspicuously missing. Look at the LMSYS Arena leaderboard for a contamination-resistant comparison.

After five minutes you should have a confident answer to: “Should I evaluate this model on my own data, or skip it?” Most models you should skip. The few that pass the five-minute filter get the full evaluation treatment.

10.11 The non-obvious failure modes

The things that aren’t on any model card and that you only learn from deploying:

  • The repetition cliff. Some models are fine for short outputs and turn into broken records past 1000 tokens. Test long-output generation explicitly.
  • The system-prompt-ignored failure. Some instruction-tuned models follow the system prompt diligently for a few turns and then drift. Test multi-turn conversations.
  • The chat template sensitivity. A model trained with a specific chat template might be very sensitive to whitespace or token differences. We covered this in Chapter 5.
  • The math regression. A model that scores well on MMLU but can’t add three-digit numbers reliably is common. Always test arithmetic separately.
  • The refusal calibration. A model trained with heavy safety RLHF might refuse benign requests. Test the refusal rate on your actual prompts.
  • The leading-whitespace bug. Some models reliably add a leading space or newline before every response. This is a tokenizer quirk. Strip it in your serving layer.
  • The streaming-vs-final disagreement. A model that produces high-quality final outputs at temperature 0 might produce low-quality intermediate streams when sampled at temperature 0.7. Test both modes.
  • The chat-template injection vulnerability. Already mentioned in Chapter 5. If user inputs can contain raw chat-template tokens, they can break out of their role. Sanitize.

These are not in the model card, they will not be in any blog post, and you will not learn them from benchmark numbers. You learn them by deploying the model and watching what happens. The senior-engineer move is to expect them — to set up a small adversarial eval suite that probes for each of these patterns before you commit to a model.

10.12 The Part I capstone

This is the last chapter of Part I. You now have the full foundation:

  • Chapter 1: the data structure (tensors, shapes, broadcasting).
  • Chapter 2: the function (forward pass, MLPs).
  • Chapter 3: the gradient (autograd, backprop).
  • Chapter 4: the optimizer (loss, AdamW, schedules).
  • Chapter 5: the input (tokens, BPE, chat templates).
  • Chapter 6: the operation (attention from first principles).
  • Chapter 7: the architecture (transformer blocks, RMSNorm, RoPE).
  • Chapter 8: the loop (autoregressive decoding, sampling).
  • Chapter 9: the encoder side (embeddings, rerankers, contrastive learning).
  • Chapter 10 (this chapter): the practical skill (reading model artifacts adversarially).

Together, these ten chapters are the prerequisites for everything that follows. You should be able to:

  • Write a transformer block from memory.
  • Explain backprop without notes.
  • Defend AdamW + warmup + cosine + grad clip in code review.
  • Diagnose a tokenizer bug in five minutes.
  • Identify pre-norm vs post-norm and explain why pre-norm wins.
  • Explain why prefill and decode are different workloads (prepping you for Part III).
  • Tell a bi-encoder from a cross-encoder and pick the right tool.
  • Read a model card in five minutes and decide whether to evaluate.

If any of these still feels shaky, go back to the relevant chapter and re-read. The next part — Training, Fine-Tuning, Alignment — assumes you have all of these on tap.

10.13 The mental model

Eight points to take into Part II:

  1. A model card is a marketing document. Read it with a checklist.
  2. Architecture and parameter count set memory and compute baselines. Find them first.
  3. Context length is overstated. Effective is one-quarter to one-half of advertised.
  4. License is the first filter. Llama is not MIT; Llama fine-tunes are not MIT either.
  5. The lineage graph matters. Trace the base before evaluating the fine-tune.
  6. Quantization always costs quality. Smaller models lose more.
  7. Eval scores are the upper bound. Real-world performance is lower.
  8. The non-obvious failure modes are the ones that matter. Set up an adversarial eval before committing.

In Part II we go up the ladder: how the models you’re now equipped to read were actually built.


Read it yourself

  • Mitchell et al., Model Cards for Model Reporting (2018) — the original Google paper that defined the term.
  • The Llama 3 model card on HuggingFace (meta-llama/Llama-3.1-8B-Instruct) — read it cover to cover. It is one of the better cards.
  • The Llama 3 paper (Grattafiori et al., 2024). Compare what’s in the paper to what’s in the model card. The gap is what model cards typically miss.
  • The Llama 3 Acceptable Use Policy and Community License Agreement — the actual legal documents. Read both.
  • The LMSYS Arena leaderboard — the contamination-resistant benchmark. Bookmark it.
  • Liu et al., Lost in the Middle: How Language Models Use Long Contexts (2023) — the canonical paper on the effective-vs-advertised context discrepancy.
  • The RULER benchmark (Hsieh et al., 2024) — the harder long-context benchmark.

Practice

  1. Open the model card for meta-llama/Llama-3.1-70B-Instruct. Run the five-minute walkthrough on it. Write down your findings.
  2. Compute the bf16 memory footprint, the int4-quantized memory footprint, and the per-token KV cache size for Llama 3.1 70B. Decide what hardware you’d need to serve it.
  3. Find a chat-tuned model on HuggingFace that doesn’t tell you what its base is. Use the techniques in §10.7 to identify the base model.
  4. Pick three open LLM model cards from different organizations. Compare their honesty about training data. Which one is most informative? Which is least?
  5. Read the Llama community license. Identify three concrete restrictions a startup would need to be aware of.
  6. Find a model on HuggingFace that claims a 1M context window. Test it on a long-context retrieval task at different needle positions and report how the quality degrades.
  7. Stretch: Build a small “model card analyzer” script that takes a HuggingFace model ID and prints (a) parameter count, (b) bf16 memory, (c) chat template, (d) license, (e) tokenizer family. The script is just a wrapper around the HF Hub API; the value is in deciding what to extract.

Concept check

4 questions. Click a choice to check. Your score is saved locally.

Score
0 / 4
  1. 1. A model card shows strong benchmark scores on MMLU, HumanEval, and MT-Bench. The most important adversarial question to ask about these numbers is
  2. 2. A model is described as derived from a base with the lineage 'base → SFT → DPO → GGUF Q4_K_M.' What does Q4_K_M tell you about serving cost vs the full-precision model?
  3. 3. A model card advertises a 128k context window. The adversarial interpretation is
  4. 4. The 'base model' under a chat fine-tune is important to identify because
Related chapters