Part I · ML Foundations
Chapter 5 Core ~20 min read

Tokens, vocabularies, and the tokenizer is the bug

"Half of all 'why is the model behaving weirdly' tickets are tokenizer tickets. The other half become tokenizer tickets after twenty minutes of debugging"

A neural network operates on tensors of numbers. Text is not a tensor of numbers. Something has to translate between them, and that something is the tokenizer. The tokenizer is the most boring component in any LLM pipeline and it is also the source of more silent bugs than any other layer. This chapter is about why.

We will cover:

  1. The choice of unit: characters, words, or something in between.
  2. Byte-Pair Encoding (BPE) — the algorithm that won.
  3. SentencePiece, WordPiece, Unigram, tiktoken — the variants you’ll meet.
  4. Vocab size and the cost-quality tradeoff.
  5. Special tokens: BOS, EOS, PAD, MASK, CLS, SEP, and the chat template tokens.
  6. Chat templates and the silent prompting disaster.
  7. Multilingual cost asymmetry.
  8. Tokenizer drift across versions.
  9. Production gotchas.

By the end you will know why “the model can’t count letters in a word” is a tokenizer fact, not a model fact, and why retraining a tokenizer is one of the hardest things in NLP.


5.1 The unit-of-text problem

When you give an LLM the string "Hello, world!", it does not see characters. It does not see words. It sees a sequence of integers — token IDs. Something has to chop the string into pieces and assign each piece an integer.

The three obvious choices for the chopping unit are:

(1) Characters. Each character is a token. Pros: tiny vocab (~256 for ASCII, a few thousand for Unicode), no out-of-vocabulary problem. Cons: sequences are very long, the model has to learn the concept of “word” from scratch, and computational cost scales with sequence length, so character-level models are expensive to train and slow to serve.

(2) Words. Each word is a token. Pros: short sequences, semantically natural. Cons: the vocab has to be enormous to cover even one language well (English has ~600k unique forms once you count inflections), the long tail is unbounded (any new word is unknown), and morphology is opaque (the model never sees that running and runs share a stem). Out-of-vocabulary words have to be replaced with <UNK>, which silently destroys information.

(3) Subwords. Compromise: chop into pieces that are usually shorter than words but longer than characters. Common stems and prefixes get their own tokens; rare words get split into multiple subword pieces. Pros: bounded vocab (typically 30k–200k), no out-of-vocabulary problem (in the limit, fall back to bytes), reasonable sequence length, sees morphology. Cons: not aligned with human word boundaries, so introspecting “what character did the model see” requires careful unwinding.

Subwords won. Every modern LLM uses a subword tokenizer of some flavor. The differences between flavors matter, but the choice “subword over word or character” is settled.

5.2 Byte-Pair Encoding (BPE) — the algorithm that won

Byte-Pair Encoding was originally a 1994 data-compression algorithm. Sennrich, Haddow & Birch repurposed it for NLP in 2016 in Neural Machine Translation of Rare Words with Subword Units. It has been the foundation of essentially every major LLM tokenizer since.

The training algorithm is this simple:

BPE training loop: start with characters, find most frequent adjacent pair, merge, repeat until vocab size reached. Corpus fragment: "low lower" Step 0 — character vocab: l o w l o w e r most frequent pair: "l o" (count 2) → merge Step 1 — after merging "l"+"o" → "lo": lo w lo w e r next: "lo"+"w" → "low" …
BPE greedily merges the most frequent character pair into a new token, so common subwords like "low" emerge naturally — rare words fall back to constituent pieces without ever needing an unknown token.
  1. Start with a vocabulary of every byte (or every character) in the corpus.
  2. Tokenize every word in the corpus as a sequence of those base symbols, with end-of-word markers.
  3. Find the most frequent adjacent pair of symbols across the entire corpus.
  4. Merge that pair into a new symbol and add it to the vocabulary.
  5. Re-tokenize the corpus to use the new symbol wherever the pair appeared.
  6. Repeat steps 3–5 until you’ve added the desired number of merges.

That’s the entire algorithm. It is greedy, frequency-based, and almost embarrassingly simple. The result is a vocabulary where common short patterns (th, ing, tion, the) get their own tokens because they appear in many words, while rare long patterns get split into pieces.

Walking through a tiny example

Suppose your corpus is just three words: low, lower, lowest. After step 2, your tokenized corpus looks like:

l o w </w>
l o w e r </w>
l o w e s t </w>

Counts of adjacent pairs:

  • l o: 3
  • o w: 3
  • w </w>: 1
  • w e: 2
  • e r: 1
  • r </w>: 1
  • e s: 1
  • s t: 1
  • t </w>: 1

Most frequent pair: l o (or o w, tie). Merge it. Vocabulary now contains a new symbol lo.

lo w </w>
lo w e r </w>
lo w e s t </w>

Most frequent pair: lo w. Merge to low.

low </w>
low e r </w>
low e s t </w>

Most frequent pair: low </w> (count 1, tied with everything). Or low e (count 2). Merge low e to lowe. And so on. After enough merges, the vocabulary will contain low, lower, lowest as single tokens, and rare new words like lows will fall back to low s.

This is the entire intuition. Real BPE training does this on billions of words, with hundreds of thousands of merges, but the algorithm is the same.

Inference (encoding new text)

Once you have the trained merge list, encoding a new word is a deterministic process:

  1. Start with the word as a sequence of base symbols.
  2. Find the highest-priority merge (the one trained earliest) that applies anywhere in the sequence.
  3. Apply it.
  4. Repeat until no more merges apply.

For words seen in the training corpus, this reproduces the trained tokenization. For unseen words, it greedily applies the merges that fit, falling back to base symbols where it has to.

Byte-level BPE

The version of BPE used by GPT-2, GPT-3, GPT-4, and most modern OpenAI models is byte-level BPE: instead of starting from characters or Unicode codepoints, it starts from raw bytes (256 base symbols). Every possible string is a sequence of bytes, so there is never an out-of-vocabulary problem — the worst case is that you fall back to one token per byte.

Byte-level BPE is what makes it possible for a model trained primarily on English to still produce something for Japanese text, emoji, or random binary data. It will be inefficient (more tokens per character), but it will work. We’ll come back to this in §5.7 when we talk about multilingual cost asymmetry.

5.3 The SentencePiece / WordPiece / Unigram / tiktoken zoo

You will see four tokenizer flavors in practice. They are all subword tokenizers; the differences are in the training algorithm and how they handle whitespace.

BPE (the original): bottom-up greedy merging by frequency. Used by GPT-2 (character-level), GPT-3/4 (byte-level via tiktoken), Llama, Mistral, most open-source models.

WordPiece: very similar to BPE but the merge criterion is likelihood (does the merge increase the corpus likelihood under a unigram language model?) rather than raw frequency. Used by BERT, DistilBERT, and the BERT family. The differences in practice are minor.

Unigram language model (Kudo, 2018): top-down. Start with a large vocabulary, score each token by how much removing it would hurt the corpus likelihood, prune the worst tokens, repeat. Produces a probabilistic tokenization where the same string has multiple possible segmentations and the model can sample among them. Used by T5, ALBERT, mT5, NLLB.

SentencePiece is not really a separate algorithm — it’s a library and a whitespace handling convention. SentencePiece treats whitespace as a normal character (it uses as a printable proxy) and doesn’t require pre-tokenization into words. This is critical for languages without spaces (Chinese, Japanese, Korean, Thai). SentencePiece can be configured to use BPE or Unigram under the hood. Llama uses SentencePiece-BPE; T5 uses SentencePiece-Unigram.

tiktoken is OpenAI’s open-source byte-level BPE implementation, used by GPT-3.5, GPT-4, and the OpenAI Embeddings API. It is fast and the source of truth for “how does the OpenAI API tokenize my text,” which matters for cost estimation.

You don’t need to know the implementation details of each. You need to know that they exist and which one your model uses, because the tokenizer is part of the model. Mixing tokenizers is a bug; we’ll get to that in §5.8.

5.4 Vocab size — the cost-quality tradeoff

Common vocab sizes:

ModelVocab size
BERT30,522
GPT-250,257
GPT-3, GPT-3.550,257 (r50k) / 100,277 (cl100k)
GPT-4100,277 (cl100k)
GPT-4o200,019 (o200k)
Llama 1, 232,000
Llama 3128,256
Qwen 2, 2.5151,936
Mistral32,000
DeepSeek-V3129,280

Vocab size is a real architectural lever. Bigger vocab means:

  • Shorter sequences. Each token covers more characters on average, so the same text becomes fewer tokens. This reduces sequence length, which reduces both training and inference compute (because attention is O(s²)).
  • Better multilingual coverage. A larger vocab can dedicate more tokens to non-English scripts. The jump from Llama 2’s 32k vocab to Llama 3’s 128k vocab was driven specifically by this — English performance is similar but non-English performance is much better.
  • Bigger embedding and unembedding matrices. Both the input embedding (vocab_size, d_model) and the output language model head (d_model, vocab_size) scale linearly with vocab size. For a 70B Llama 3 with d_model=8192, the embedding+head are about 2.1B parameters — a few percent of total — versus about 0.5B for the smaller Llama 2 vocab. Not huge, but not zero.
  • Worse generalization on rare tokens. Each individual token in a 128k vocab is seen less often during training than each token in a 32k vocab on the same corpus. The long tail is harder to learn.

The “right” vocab size is a function of the corpus, the languages you care about, and the model size. The current consensus for general-purpose LLMs is in the 100k–200k range. Don’t expect another order-of-magnitude jump — at some point the marginal token isn’t worth the embedding-table cost.

5.5 Special tokens

Every tokenizer reserves some token IDs for special tokens that aren’t part of the natural language vocabulary. The standard set:

  • <bos> (or <s>) — beginning of sequence. Some models always prepend it; others don’t.
  • <eos> (or </s>) — end of sequence. The model is trained to emit this when a generation should stop.
  • <pad> — padding token, used to fill the right side of short sequences in a batch up to the longest sequence’s length. Excluded from loss computation via the attention mask.
  • <unk> — unknown token. In a byte-level BPE model, this is technically never needed because every byte is in vocab; in word-piece or character-level models, it’s the fallback.
  • <mask> — for BERT-style masked language modeling.
  • <cls> and <sep> — BERT’s classification and segment-separator tokens.

For instruction-tuned LLMs, there is also a family of chat template tokens that mark the role of each turn:

  • <|system|>, <|user|>, <|assistant|> — Llama-style.
  • <|im_start|>, <|im_end|> — ChatML / Qwen / GPT-style.
  • <|begin_of_text|>, <|end_of_text|>, <|start_header_id|>, <|end_header_id|>, <|eot_id|> — Llama 3’s elaborated set.

These are the most important special tokens in modern LLM serving and the most common source of bugs. We’ll dedicate the next section to them.

5.6 Chat templates and the silent prompting disaster

Here is how 80% of “the model is being weird” bugs happen in production.

Chat template token flow: special role tokens wrap system, user, and assistant turns; missing or wrong tokens cause silent model degradation. Correct Llama 3 chat template (every byte matters) <|begin…|> <|start_header…|> system <|end_hdr|> You are… <|eot_id|> <|start_header…|> user <|end_hdr|> What is 2+2? <|eot_id|> <|start_header…|> assistant <|end_hdr|> ← model generates ← no eot here! Wrong (plain text — model still runs, output silently degrades) system: … user: … assistant: missing special tokens → model confused, repetitive, ignores system prompt
Every special role token (highlighted) is part of the model's learned structure — omitting even one causes silent capability degradation that looks like a "dumb model" bug but is actually a prompt-formatting bug.

When you fine-tune an instruction model (Chapter 16), you train it to respond to messages formatted in a specific way — the chat template. For example, Llama 3 expects this format for a conversation:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is 2+2?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Note the exact whitespace, the exact tokens, the trailing newlines, the absence of an <|eot_id|> after the assistant header (because we’re inviting the model to generate the response). Every byte matters. The model was trained on this exact format and learned to recognize the structural tokens as the boundary between “what the user said” and “what I should produce.”

If you serve the model with the wrong template — say, you forget the <|begin_of_text|> BOS, or you put a space where there shouldn’t be one, or you concatenate “system: … user: … assistant: …” in plain text without the special tokens — the model will still produce something. It will not crash. It will just be subtly worse: more repetitive, more prone to ignoring the system prompt, more likely to break out of character. You will spend two days wondering why the model is dumber in production than in your local tests, and the answer will be “you forgot the chat template.”

The fix is to always use the tokenizer’s apply_chat_template() method rather than concatenating strings yourself:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,   # appends the assistant header
)

The template lives inside the tokenizer config (tokenizer_config.json has a Jinja template string in the chat_template field). Different model variants have different templates. When in doubt, look at the template the tokenizer ships with, not at the prose in the model card.

Two specific gotchas:

  • Some models prepend BOS automatically when tokenizing, others don’t. If you call tokenizer(prompt) after apply_chat_template(prompt, tokenize=False), you may get a double BOS. The fix is tokenizer(prompt, add_special_tokens=False).
  • The “generation prompt” — the part of the template that invites the model to start producing — varies by model. Llama 3 ends with <|start_header_id|>assistant<|end_header_id|>\n\n, ChatML ends with <|im_start|>assistant\n. Forgetting the trailing newline can completely change the model’s behavior.

This single section is worth more in production debugging than all the architecture chapters combined. It is also the single most-asked area in LLM-systems interviews after KV cache.

5.7 Multilingual cost asymmetry

A token in a byte-level BPE model represents (on average) different amounts of text in different languages. English is the language the tokenizer was overwhelmingly trained on, so English tokens cover lots of characters. Non-English text falls back to shorter tokens — sometimes one token per character or even one token per byte — which means the same idea takes 2–10× more tokens in some languages than in English.

The famous numbers, for OpenAI’s cl100k_base tokenizer:

LanguageApproximate tokens per “character of meaning”
English1.3
Spanish1.5
French1.5
German1.7
Russian2.0
Chinese2.5
Japanese2.7
Hindi5.0
Burmese8.0

The implications are real and unfair:

Same semantic content encoded in different languages shows 1x to 8x token-count ratio, with English cheapest and Burmese most expensive. Language tokens / meaning unit (relative) EN ES FR DE RU ZH JA MY ~6×
Burmese users pay roughly 6× as much per API call as English users for the same semantic content — a fairness and cost problem that larger vocabularies (Llama 3's 128k) partially address.
  • You pay more. OpenAI bills per token. A Hindi-speaking user gets billed several times as much as an English speaker for the same prompt.
  • Your context window is smaller. A 128k context window holds 30k tokens of Burmese vs 90k tokens of English.
  • Your latency is worse. More tokens to prefill = more compute = slower first-token time. More tokens to decode = more steps = slower per-completion.
  • Your model is worse. Fewer training samples per token, more brittle behavior on the long tail.

This is why Llama 3’s vocab jumped from 32k to 128k — a deliberate move to spread vocab across more languages. It didn’t help English at all; it helped everyone else a lot. Tokenizer choices are a fairness issue and an engineering issue at the same time.

5.8 Tokenizer drift across versions

Tokenizers get updated. New models often ship with new tokenizers that are almost the same but not quite. The token IDs for the same text can shift between versions, which means:

  • Embeddings trained on one tokenizer don’t work with another. If you embed your knowledge base with text-embedding-ada-002 and then switch to text-embedding-3-small, you have to re-embed everything. The new model has a different tokenizer and a different embedding space.
  • Cached KV caches are invalidated. Prefix caching (Chapter 29) relies on exact token-sequence matches. Change the tokenizer and your cache hit rate goes to zero.
  • Token-level cost estimates break. If your billing code assumes a fixed tokenizer and you upgrade the model, your cost estimates start lying.

The discipline: the tokenizer is part of the model artifact. Pin them together. Never assume “the tokenizer is the same across versions” without checking. The HuggingFace AutoTokenizer.from_pretrained(model_id) pattern is the right one — it loads the exact tokenizer that ships with that exact model.

A famous variant of this bug: the chat template is part of the tokenizer config. Two versions of a model with the same architecture and the same vocab can ship with different templates. The Llama 3 → Llama 3.1 → Llama 3.2 → Llama 3.3 chain is full of small template changes that bite people.

5.9 Production gotchas

A handful of things that will save you debugging time:

The “letter counting” disaster. A model “can’t count the number of rs in ‘strawberry’” because to the model, “strawberry” is two or three opaque token IDs, not a sequence of letters. The model has no character-level view of its own input. This is a tokenizer fact, not an intelligence fact, and explaining it correctly in interviews is a strong signal.

The leading-space gotcha. In most BPE tokenizers, words at the start of a sequence and words after a space are tokenized differently. The token for " hello" (with leading space) is different from the token for "hello". This means tokenizer.encode("hello world") and tokenizer.encode("hello") + tokenizer.encode("world") produce different results. Always tokenize the full string at once.

The trailing-newline gotcha. Some models are very sensitive to whether the prompt ends with a newline, two newlines, or no newline. When in doubt, match the chat template exactly.

Mixing tokenizers between training and inference. If you fine-tune with one tokenizer config and serve with a slightly different one (different special tokens, different added tokens), the model behavior degrades silently. Always use the same config.

Token ID ranges as an attack surface. Some special tokens are not in the “normal” vocab range; if you allow user input to contain raw text that decodes to a control token (<|im_end|>, <|endoftext|>), the user can break out of the chat template and impersonate other roles. This is the LLM equivalent of SQL injection. The fix is to either (a) escape user input before tokenization, or (b) use a tokenizer that refuses to encode special tokens from raw text (the add_special_tokens=False and split_special_tokens=True options in HuggingFace tokenizers).

5.10 The mental model

Eight points to take into Chapter 6:

  1. Subwords won. Every modern LLM uses some flavor of BPE, WordPiece, or Unigram.
  2. Byte-level BPE has no out-of-vocabulary. GPT-2/3/4 and most OpenAI models use it.
  3. The tokenizer is part of the model artifact. Mixing them is a bug.
  4. Vocab size is a real architectural lever. Bigger vocab = shorter sequences, better multilingual, bigger embedding tables.
  5. Special tokens and chat templates are 80% of “the model is being weird” bugs. Always use apply_chat_template.
  6. Tokenization is unfair across languages. English costs less. Track this consciously.
  7. The “count the r’s in strawberry” failure is a tokenizer story, not an intelligence story.
  8. Tokenizer drift across model versions invalidates caches, embeddings, and cost models. Pin versions.

In Chapter 6 we use these tokens to build the most important operation in the transformer: attention.


Read it yourself

  • Sennrich, Haddow & Birch, Neural Machine Translation of Rare Words with Subword Units (2016) — the BPE paper. Six pages. Read it.
  • Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (2018) — the Unigram tokenizer paper.
  • HuggingFace’s Tokenizers library docs — the practical reference for everything in this chapter.
  • The OpenAI tiktoken repo on GitHub — read the README for the encoding table.
  • Andrej Karpathy’s YouTube video Let’s build the GPT Tokenizer — two hours of building BPE from scratch in Python. The most efficient way to internalize §5.2.

Practice

  1. Use tiktoken to tokenize the same sentence in English, Russian, and Hindi. Compare the token counts. Predict the cost ratio for an OpenAI API call.
  2. In tiktoken.get_encoding("cl100k_base"), find the token IDs for "hello", " hello", "Hello", and " Hello". Are any of them the same? Why or why not?
  3. Use transformers.AutoTokenizer to load meta-llama/Llama-3.1-8B-Instruct’s tokenizer. Print tokenizer.chat_template. Read the Jinja template carefully and write out, by hand, the exact string that apply_chat_template produces for a system+user+assistant turn.
  4. Train a tiny BPE tokenizer on a small text corpus (the first chapter of a book is enough) using the HuggingFace tokenizers library. Set vocab size to 1000. Inspect the merges. Which patterns get learned first?
  5. Tokenize the string "strawberry" with the GPT-4 tokenizer. How many tokens? How many characters? What does this tell you about why the model can’t count letters?
  6. Why does mixing two tokenizers — say, encoding with GPT-4’s tokenizer and decoding with GPT-2’s — not produce a clean error? What does it produce instead, and why is that worse than an error?
  7. Stretch: Implement byte-level BPE training from scratch in Python in under 200 lines. Train it on a small corpus and verify it produces sensible merges. The Karpathy video is the best resource if you want a walkthrough.

Concept check

4 questions. Click a choice to check. Your score is saved locally.

Score
0 / 4
  1. 1. Byte-Pair Encoding (BPE) constructs its vocabulary by repeatedly
  2. 2. A user asks why GPT-4 cannot correctly count the letters in the word 'strawberry.' The real explanation is
  3. 3. A model trained with tokenizer version A is deployed with tokenizer version B, where a common token was split differently. What is the most likely production symptom?
  4. 4. Why do languages with complex morphology (e.g., Turkish, Finnish) cost more to process with BPE tokenizers trained primarily on English?
Related chapters