Chapter 14: Tokenizer training: BPE and SentencePiece from scratch

In Chapter 5 we covered what a tokenizer is and how it bites. In this chapter we go up the stack: how tokenizers are trained. By the end you’ll understand the BPE training algorithm down to the last counter, the role of the corpus, what vocab size actually controls, the multilingual problem, and why retraining a tokenizer is one of the hardest things in NLP.

This chapter is shorter than the others in Part II because the algorithm is simple — what makes tokenizer training hard is the corpus selection and the downstream consequences, both of which we already touched on in Chapter 5.

Outline:

The tokenizer-training problem.
BPE training, in detail.
The corpus selection question.
Vocab size — what to optimize against.
Whitespace handling and SentencePiece.
Special tokens and reserved IDs.
Multilingual training.
The HuggingFace tokenizers library in 30 lines.
Why you can never retrain the tokenizer.

14.1 The problem

Given a corpus of text, produce a tokenizer with vocab size V that:

Has every character (or every byte) representable somehow, so there’s no out-of-vocabulary problem.
Compresses common patterns into single tokens to reduce sequence length.
Spreads its vocab budget across the languages and domains in the corpus.
Produces consistent, deterministic tokenizations of the same text every time.

These goals are in tension. Compressing more aggressively reduces sequence length but means each token is rarer and harder to learn. Spreading vocab across more languages reduces the per-language compression. Including domain-specific tokens (math, code, scientific notation) helps those domains and hurts others by stealing vocab slots.

The choice of training algorithm answers most of these questions for you: BPE produces a vocab that greedily prioritizes the most frequent patterns. The result is unsupervised, fast to train, and reasonably aligned with linguistic units (morphemes, common stems) without ever being told what a morpheme is. Other algorithms (Unigram, WordPiece) produce slightly different vocabs by optimizing slightly different objectives, but the BPE intuition is the foundation.

14.2 BPE training, in detail

BPE training is a frequency-based agglomerative procedure. The complete algorithm in eight bullet points:

(1) Read the training corpus into memory. For each word (or each contiguous run of characters in a language without word boundaries), record the word and its frequency. The result is a dict[word, count] with a few million entries for a serious corpus.

(2) Pre-tokenize each word into a sequence of base symbols. For byte-level BPE, the base symbols are the 256 possible byte values. For character-level BPE, they’re the unique characters in the corpus. For SentencePiece, they’re slightly different (see §14.5). Append an end-of-word marker (</w> or ▁) to each word so the tokenizer can recover word boundaries from the token sequence.

(3) Initialize the vocabulary to the base symbols. So you start with 256 entries (for byte-level) or a few thousand (for character-level Unicode).

(4) Initialize the merge list to the empty list. The merge list is the ordered sequence of “merge this pair into this new symbol” operations that defines the tokenizer.

(5) Build a pair-frequency table: for each adjacent pair of symbols across all the words in the corpus (weighted by word frequency), count how often that pair occurs.

(6) Find the most frequent pair in the table. Add it to the merge list. Add a new symbol to the vocabulary representing the merged pair.

(7) Update the corpus: in every word, replace every occurrence of the merged pair with the new symbol. Update the pair-frequency table (this is the part that needs efficient incremental data structures — we’ll come back to it).

(8) Repeat steps 6–7 until the vocabulary reaches the target size.

That’s the entire algorithm. The output is a vocabulary (the set of all symbols that exist) and a merge list (the ordered sequence of merges that produced the vocabulary). Both ship with the tokenizer.

BPE greedily merges the most frequent adjacent pair at each step; the merge list records the order, and applying it left-to-right at inference recovers the same tokenization deterministically.

The data structures

A naive implementation rebuilds the pair-frequency table from scratch after every merge. For a corpus of 10M words and a target vocab of 100k, that’s 10⁵ rebuilds × 10⁷ words = 10¹² operations, which is too slow.

The trick is incremental updates. The pair-frequency table is maintained as a hash map. When you merge pair (a, b) into symbol ab, you:

For every word containing a b (which you can find efficiently by indexing the words by which pairs they contain), update the surrounding pair counts:
- The pair (prev, a) decreases by the word’s frequency.
- The pair (prev, ab) increases by the word’s frequency.
- Similarly on the right side: (b, next) decreases, (ab, next) increases.
Update the word’s symbol sequence.

With this kind of incremental bookkeeping, BPE training scales to multi-billion-token corpora with vocab sizes in the hundreds of thousands. The HuggingFace tokenizers library does it in highly-optimized Rust and trains a 100k-vocab tokenizer on 30 GB of text in under an hour on a single machine.

Determinism and ties

When two pairs have the same frequency, you have to break the tie. The standard choice: lexicographic ordering on the pair, plus a stable iteration order. This is what makes BPE reproducible — the same corpus and same target vocab always produces the same tokenizer.

What gets learned

Walk through what BPE will learn on an English corpus, starting from byte-level base symbols:

The first ~256 tokens are the base bytes. Already there.
The first few hundred merges are very common bigrams: th, he, in, er, an, re, on, at, ed, nd, …
The next few thousand merges are common short subwords: the, ing, tion, ment, ly, ed, er, able, …
Then common short words: the, of, and, to, a, …
Then longer common words: because, people, would, …
Then the long tail: domain-specific terms, names, rare words.

The tokenizer is essentially learning Zipf’s law: the most common patterns get their own tokens, in roughly decreasing order of frequency. This is why a BPE tokenizer is unreasonably effective without ever being told what a “word” or a “morpheme” is.

14.3 The corpus selection question

The tokenizer is trained on a sample of the pretraining corpus. Not the full thing — that would be unnecessarily slow — but a representative sample of tens of GB.

The choice of sample matters. The tokenizer’s vocab is shaped by what it sees, and if you train the tokenizer on only English Wikipedia and then pretrain on multilingual web data, the tokenizer is going to be terrible at non-English text (and will produce many more tokens per character for those languages).

The right approach: train the tokenizer on a sample that mirrors the language and domain mix of the pretraining corpus. If your pretraining is going to be 10% code, your tokenizer training sample should be 10% code. If your pretraining is going to be 30% non-English, the sample should be 30% non-English.

This is one of the things that distinguishes a careful tokenizer (Llama 3, Qwen 2.5, DeepSeek) from a sloppy one. Sloppy tokenizers are trained on a convenient sample (often “the first slice of the corpus we had on disk”), and the resulting models are unfair across languages and domains.

A useful diagnostic: after training the tokenizer, measure the average tokens per character on held-out text in each language and domain. A well-trained tokenizer for English should be in the 0.25–0.3 tokens/char range; for Chinese, 0.6–1.0; for Hindi, 1.0–2.0. Compare these numbers across the languages you care about; large asymmetries are a red flag.

14.4 Vocab size — the real tradeoff

We talked about vocab size in Chapter 5. The training-time perspective:

Bigger vocab = longer tokenizer training. The number of merges you do is roughly the vocab size, so a 200k vocab takes roughly 2× the time of a 100k vocab to train. This is small in absolute terms (hours, not days).
Bigger vocab = bigger embedding matrices in the model. The input embedding is (V, D) and the LM head is (D, V). For a 70B Llama 3 with V=128k and D=8192, that’s about 2.1B parameters in embeddings — a few percent of the model. Doubling V doubles this contribution.
Bigger vocab = shorter sequences. Each token covers more characters on average. A 200k vocab might compress text to 80% the length of a 100k vocab on the same corpus.
Bigger vocab = harder to learn the long tail. Each rare token is seen less frequently in training, so the embedding for it is less well-fit. There’s no free lunch.

The current consensus for general-purpose LLMs is in the 100k–200k range. The argument for going higher: better multilingual coverage. The argument for going lower: simpler model, smaller embeddings, faster training of the embedding layer. Most modern open LLMs land in the 128k–200k range.

Larger vocab compresses sequences (good for compute) but grows the embedding table and leaves rare tokens under-trained — the 128k–200k range balances these forces for general-purpose LLMs.

The exception is models targeting a specific narrow domain — a code-only model might use a smaller vocab (50k) optimized for code, with no vocab spent on natural language at all. A medical model might use a vocab heavy in medical terminology. The choice is always a function of the corpus and the target use case.

14.5 Whitespace handling and SentencePiece

A subtle but consequential part of tokenization: how do you handle whitespace?

The naive approach (used by the original BPE paper for translation) is to pre-tokenize the input by splitting on whitespace, train BPE on each “word” independently, and then re-tokenize at inference time the same way. This works for languages with explicit word boundaries (English, French, Russian) but breaks for languages without spaces (Chinese, Japanese, Korean, Thai). Pre-tokenizing Chinese on whitespace gives you whole sentences as single “words,” which is useless.

The fix is SentencePiece (Kudo & Richardson, 2018), which treats whitespace as a normal character and doesn’t pre-tokenize at all. Spaces are converted to a special character (▁, often called “thick underscore”), and BPE runs on the entire byte stream including spaces. The result:

A token for ▁the exists alongside a token for the (the second is for cases where it appears mid-word).
The exact byte sequence is recoverable from the tokens by replacing ▁ with a space.
Languages without spaces work natively — the algorithm doesn’t need to know what a word is.

SentencePiece is the right choice for any modern multilingual model. It’s used by Llama (with BPE under the hood), T5 (with Unigram under the hood), and many others.

The HuggingFace tokenizers library has a ByteLevel pre-tokenizer that achieves a similar effect for byte-level BPE: spaces become a specific byte, and the algorithm runs over bytes uniformly. Llama 3, Qwen, and most modern OpenAI models use this approach.

The tradeoff: with SentencePiece-style whitespace handling, the same word is tokenized differently depending on whether it has a leading space. hello and hello are different token IDs. This is the “leading-space gotcha” from Chapter 5, and it’s a direct consequence of the whitespace-as-character design choice.

14.6 Special tokens and reserved IDs

Before training, you reserve some token IDs for special tokens that aren’t learned by BPE — they’re added by hand. The standard set:

<bos> / <s> — beginning of sequence
<eos> / </s> — end of sequence
<pad> — padding
<unk> — unknown (rarely needed in byte-level BPE)
<mask> — for BERT-style MLM training

For chat models, you also reserve:

<|system|>, <|user|>, <|assistant|>
<|im_start|>, <|im_end|>
Or whatever role tokens your chat template uses

These tokens are added to the vocabulary as-is, without training. The corresponding embedding entries are randomly initialized at model startup and learned during pretraining (for the structural tokens) or during instruction tuning (for the chat-template tokens, since the base model doesn’t know the chat template yet).

A typical LLM tokenizer reserves the first ~256 token IDs for special tokens and base bytes, and the remaining 100k+ for learned BPE merges. The reserved range is large because some models reserve “spare” slots for future special tokens — DeepSeek-V3, for example, reserves a few hundred unused IDs that it can fill in later versions without breaking the embedding table layout.

The discipline: when you add new special tokens after training (e.g., a new role token for a fine-tuned chat model), you have to resize the embedding table to accommodate them, and the new entries are randomly initialized. This is fine for fine-tuning but wastes some training effort. Some teams reserve “future special tokens” up front for this reason.

14.7 Multilingual training

Training a tokenizer that works well in many languages is one of the harder problems in this chapter. The straightforward approach — sample text from each language proportionally to its representation in the corpus — produces a tokenizer that’s biased toward the largest languages.

The fix is to upweight rare languages during tokenizer training. Instead of sampling each language proportionally to its frequency, you sample each language proportionally to frequency^α for some α < 1. With α = 0, every language is sampled equally; with α = 1, every language is sampled in proportion to its corpus share. Common values are α = 0.3 to 0.5, which gives smaller languages much more representation in the tokenizer than they would have in pure proportional sampling.

Temperature-based language sampling (α ≈ 0.3) redistributes vocab budget from high-frequency languages toward rare ones — smaller languages gain tokens that would otherwise go to more common bigrams in English.

This doesn’t solve the problem (English will still get more vocab budget than Burmese, because English has more distinct patterns), but it shifts the unfairness substantially.

The Llama 3 tokenizer training reportedly used a temperature-based sampling scheme with α ≈ 0.3, which is one of the reasons Llama 3 is much more multilingual than Llama 2. Qwen’s tokenizer uses an even more aggressive multilingual mix.

14.8 The HuggingFace `tokenizers` library in 30 lines

For practical work, you don’t write BPE training from scratch — you use the HuggingFace tokenizers library, which is implemented in Rust and heavily optimized.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder

tokenizer = Tokenizer(BPE(unk_token=None))
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
tokenizer.decoder = ByteLevelDecoder()

trainer = BpeTrainer(
    vocab_size=128_000,
    special_tokens=[
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|im_start|>",
        "<|im_end|>",
    ],
    initial_alphabet=ByteLevel.alphabet(),
    show_progress=True,
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
tokenizer.save("my_tokenizer.json")

# Use it
encoding = tokenizer.encode("Hello, world!")
print(encoding.ids)
print(encoding.tokens)

That’s a complete byte-level BPE training pipeline. The corpus.txt file should contain the training text you want the tokenizer to specialize on. For multilingual training, concatenate text from multiple languages with the appropriate temperature reweighting.

Once you have the trained tokenizer, you can wrap it in a HuggingFace PreTrainedTokenizerFast for use with the rest of the transformers library.

14.9 Why you can never retrain the tokenizer

This is the punchline of the chapter. Once you’ve pretrained a model, you cannot change its tokenizer. The reasons:

The token IDs are baked into the embedding table. Each row of the embedding matrix corresponds to a specific token ID. If you change the tokenizer, the IDs shift, and every row of the embedding now points to a different token. The model has to relearn its embeddings from scratch.
The model has learned token-specific patterns. It knows that token 9384 tends to follow certain other tokens in certain contexts. Reassigning 9384 to a different word breaks all of that.
The downstream pipeline is built around the tokenizer. Chat templates, prompt engineering, RAG pipelines, eval suites — they all encode text into the model’s tokenizer. Changing the tokenizer means re-running every cached embedding, every prefix cache, every fine-tuning dataset.

The practical implication: the choice of tokenizer is committed at the start of pretraining and cannot be revisited. A bad tokenizer is a permanent tax on the model. This is why frontier labs spend so much effort on the tokenizer choice — they only get one shot, and the consequences are permanent.

The exception is continued pretraining with vocab expansion. You can add new tokens to a tokenizer (the existing IDs remain stable, the new tokens get fresh IDs at the end), and then continue training to learn the new embeddings. This is sometimes done for domain adaptation: take a general LLM, add ~10k domain-specific tokens (medical terms, legal terms, code symbols), continue training. The model learns the new tokens without losing what it had. But you cannot remove tokens, rearrange them, or change the merge list — all of that breaks compatibility.

This permanence is the deeper reason chat templates matter so much (Chapter 5). Once you’ve trained a model with a specific chat template using specific reserved tokens, you can never change those tokens. The template is part of the tokenizer is part of the model.

14.10 The mental model

Eight points to take into Chapter 15:

BPE training is greedy frequency-based merging. Start from base bytes, repeatedly merge the most frequent pair, until you hit the target vocab size.
The corpus the tokenizer is trained on shapes everything downstream. Match the tokenizer training mix to the pretraining mix.
Vocab size is a tradeoff between sequence length, embedding parameters, and long-tail token quality. 100k–200k is the modern range.
SentencePiece treats whitespace as a character, which makes it work for languages without spaces and produces the leading-space gotcha as a side effect.
Special tokens are added by hand, not learned. Reserve future slots if you can.
Multilingual training requires temperature-based language sampling to give rare languages a fair share of the vocab.
The HuggingFace tokenizers library in 30 lines is enough to train a real BPE tokenizer.
You cannot retrain the tokenizer once a model is pretrained. The choice is permanent.

In Chapter 15 we shift from pretraining to fine-tuning: how to take a pretrained model and adapt it to your task without spending another $50M.

Read it yourself

Sennrich, Haddow & Birch, Neural Machine Translation of Rare Words with Subword Units (2016) — the BPE paper, again. Section 3 is the algorithm in detail.
Kudo & Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018) — the SentencePiece paper.
The HuggingFace tokenizers library documentation and source. Actually read the BPE trainer source in tokenizers/src/models/bpe/trainer.rs. It’s 800 lines of optimized Rust and you can follow it.
Andrej Karpathy’s Let’s build the GPT Tokenizer YouTube video — two hours of building BPE from scratch in Python with full explanations.

Practice

Implement BPE training from scratch in pure Python. Train it on a corpus of ~1MB of text with target vocab 5000. Inspect the merge list. Which patterns get learned first?
The HuggingFace tokenizers library can train a 100k-vocab tokenizer on 30GB of text in under an hour. Estimate the throughput in (words processed per second, merges per second). Why is it so fast?
Train a tokenizer on (a) only English Wikipedia, (b) only Chinese Wikipedia, (c) a 50/50 mix. For each, measure tokens-per-character on held-out English and Chinese text. Compare.
Why does temperature-based language sampling (α < 1) give more vocab to rare languages? Show the math for α = 0.3 on a 3-language corpus with proportions 90/9/1.
Why can you add new special tokens to a trained tokenizer but not change the BPE merge list? Trace what happens to the embedding matrix in each case.
Take a pretrained model (e.g., meta-llama/Llama-3.1-8B) and resize its embedding table to add 10 new special tokens. Verify the existing token IDs still produce the same embeddings.
Stretch: Train a SentencePiece tokenizer on a multilingual corpus of your choice using sentencepiece Python library, with vocab size 50k and α = 0.3. Compare the resulting per-language tokenization rates against a tokenizer trained without language reweighting.

Concept check