Chapter 32: Multimodal: vision-language, audio, the tokenizer trick

The transformer architecture from Chapter 7 doesn’t care what the tokens represent. It takes a sequence of vectors and produces another sequence of vectors. The vectors can come from text tokens, image patches, audio frames, video clips, robot sensor readings, or anything else. As long as you can produce a sequence of vectors and define a loss, the same architecture works.

This chapter is about how that idea is applied in practice. By the end you’ll understand:

How vision-language models turn images into “tokens.”
The two main fusion strategies (early fusion vs cross-attention).
The role of vision encoders (ViT, SigLIP, CLIP).
How audio is similarly tokenized (Whisper, Qwen-Audio).
Why multimodal models change the prefill picture so dramatically.
The Qwen-VL family and the modern multimodal pattern.

This is Stage 2’s last chapter. By the end you’ll have the full picture of practitioner-level inference internals, ready for Stage 3 (research frontier) starting in Chapter 33.

Outline:

The “tokenize everything” framing.
Vision encoders: ViT, CLIP, SigLIP.
Image tokens and the patch embedding.
Early fusion vs cross-attention.
The vision-language model architecture in modern open models.
Inference cost: why VL prefill is enormous.
Audio tokenization.
Video models.
The serving implications.

32.1 The framing

A transformer takes a sequence of vectors and produces a sequence of vectors. The vectors at the input have to come from somewhere. For text, they come from a learned embedding table indexed by token IDs (Chapter 5). For images, they come from a vision encoder that maps image patches to vectors. For audio, they come from a spectrogram or waveform encoder. The trick is the same: turn the modality into a sequence of vectors that the transformer can process.

The vision encoder is typically pretrained separately (often with contrastive learning) and either frozen or fine-tuned alongside the language model. The audio encoder is the same.

Once you have vectors from the encoder, you concatenate them with the text token embeddings (or interleave them, depending on the architecture) and feed the combined sequence into the transformer. The transformer treats them as just more tokens. The fact that some came from images and some from text is invisible to the attention mechanism.

This is the “tokenize everything” framing. It’s elegant and it’s the basis of every modern multimodal model.

graph LR
  Img[Image] -->|Vision Encoder| ImgTok[Image Tokens]
  Aud[Audio] -->|Audio Encoder| AudTok[Audio Tokens]
  Vid[Video] -->|Frame Encoder| VidTok[Video Tokens]
  Txt[Text] -->|Tokenizer| TxtTok[Text Tokens]
  ImgTok --> Cat[Concatenate]
  AudTok --> Cat
  VidTok --> Cat
  TxtTok --> Cat
  Cat --> LLM[Transformer LLM]
  LLM --> Out[Output Tokens]
  style Cat fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style LLM fill:var(--fig-surface),stroke:var(--fig-border)

Every modality is independently encoded into a sequence of vectors, then concatenated into a single input sequence — the transformer itself is modality-agnostic.

32.2 Vision encoders — ViT, CLIP, SigLIP

The dominant vision encoder family is the Vision Transformer (ViT). ViT (Dosovitskiy et al., 2020) was the paper that showed transformers could replace CNNs for image classification. The architecture:

Split the input image into fixed-size patches (e.g., 16×16 pixels each).
Flatten each patch into a vector and apply a learned linear projection to get a patch embedding.
Add a learned positional embedding to each patch.
Run a standard transformer encoder over the sequence of patch embeddings.
The output is one vector per patch, plus a “[CLS]” vector if you used the BERT-style pooling.

For a 224×224 image with 16×16 patches, you get (224/16)² = 196 patches → 196 vector “tokens” feeding into the transformer.

ViT was the architecture; CLIP (Radford et al., 2021) was the training objective that made vision encoders broadly useful. CLIP trained a ViT image encoder and a text encoder jointly using a contrastive loss: image-caption pairs are pulled together in embedding space, mismatched pairs are pushed apart. The result is a vision encoder that produces embeddings that are aligned with text embeddings — you can take an image, embed it with CLIP, and the resulting vector lives in roughly the same space as the embedding of the image’s caption.

This alignment is the foundation of multimodal LLMs. A vision encoder pretrained on CLIP-style contrastive learning produces vectors that are “linguistically meaningful” — close to the text descriptions of the same content. When you feed those vectors into an LLM as input tokens, the LLM can immediately use them because they’re in a familiar embedding space.

SigLIP (Zhai et al., 2023) is a refinement of CLIP that uses a sigmoid loss instead of softmax. It’s faster to train, scales better, and produces slightly better embeddings. SigLIP is the vision encoder in many modern multimodal models including PaliGemma and the Qwen-VL family.

32.3 Image tokens and the patch embedding

Walk through how an image becomes “tokens” for a multimodal LLM. For a typical setup with a 224×224 image and a 16×16 patch size:

Patchify. Split the image into 14 × 14 = 196 patches of 16 × 16 × 3 pixels each (3 channels for RGB).
Flatten and project. Each patch becomes a vector of 16 × 16 × 3 = 768 raw pixels, then a learned linear layer maps it to the model’s hidden dimension (e.g., 1024 for SigLIP-Large).
Add positional embeddings. Each of the 196 patches gets a learned position vector added to it, encoding its 2D location in the image.
Run through ViT. The 196 patch vectors go through a transformer encoder (typically 12-24 layers), producing 196 contextualized vectors as output.
Optional: project to LLM hidden dim. A small linear projection (sometimes a 2-layer MLP) maps each ViT output vector to the LLM’s hidden dimension.
Feed into the LLM. The 196 vectors are inserted into the LLM’s input sequence as “image tokens,” replacing a placeholder <image> token in the text prompt.

After this, the LLM sees [text_tokens, image_tokens, more_text_tokens] and treats them all uniformly. Attention can flow between image tokens and text tokens; the model learns to use the image content to inform its text generation.

An image becomes LLM input tokens through five steps — patchify, project, position-encode, ViT-encode, project to LLM dim — and the token count scales quadratically with resolution.

For higher-resolution images (e.g., 1024×1024), the number of patches is (1024/16)² = 4096. One image becomes 4096 input tokens. This is the source of the prefill cost explosion we’ll discuss in §32.6.

Modern VL models often use dynamic resolution: the image is split into multiple “tiles” of standard size, plus a smaller “thumbnail” of the whole image. A 2048×2048 image might become 16 tiles of 512×512 plus one 256×256 thumbnail, producing 16 × 1024 + 256 ≈ 16,640 image tokens. This is the Qwen-VL approach and is why those models handle complex images well — but it also explains why their prefill is enormous.

32.4 Early fusion vs cross-attention

Two main architectures for combining vision and language inside the transformer:

Early fusion (the modern default)

Image tokens and text tokens are concatenated into one sequence and processed by the same transformer stack. There’s no architectural distinction between modalities — the transformer learns to handle both. This is what Llama 3.2 Vision, Qwen-VL, GPT-4V, and most modern open multimodal models do.

The pros:

Simple. No special architecture, just longer input sequences.
Composable. Easy to add new modalities — just plug in a new encoder.
Same architecture for training and inference. No special code paths.

The cons:

Expensive prefill. All those image tokens have to go through every layer. We’ll quantify this.
No modality-specific specialization. The model has to learn to handle both modalities in the same parameters.

Cross-attention (the older approach)

The transformer has two parallel paths: one for text, one for images. The text path is the standard transformer; the image path is the vision encoder. Cross-attention layers in the text path “look at” the image features without merging them into the same sequence.

This is the approach used in Flamingo (Alayrac et al., 2022), the original LLaVA design (later moved to early fusion), and IDEFICS. It’s more parameter-efficient because the image features only enter through cross-attention, not through every layer.

The cons of cross-attention are that it’s more complex (two paths, special layers) and that the modern open community has standardized on early fusion. As of late 2025, early fusion is the dominant approach and what every new model defaults to.

Early fusion is architecturally simpler and the modern default — cross-attention is more parameter-efficient but requires special layers and has lost community momentum.

32.5 The modern multimodal architecture

Concretely, the architecture of a modern open multimodal LLM (e.g., Qwen2.5-VL, LLaVA, Llama 3.2 Vision):

[Image] → [Vision Encoder (ViT/SigLIP)] → [Image Features (N_image, D_vision)]
                                                  |
                                                  V
                                          [Projector (MLP)]
                                                  |
                                                  V
                                          [Image Tokens (N_image, D_llm)]
                                                  |
[Text Prompt] → [Tokenizer] → [Text Tokens (N_text, D_llm)]
                                                  |
                                                  V
                              [Concat: text + image + text] → [LLM]
                                                                 |
                                                                 V
                                                          [Output Tokens]

The pieces:

Vision encoder: ViT-based, often SigLIP. Pretrained, sometimes fine-tuned during VL training. Produces 200-2000+ “image tokens” depending on resolution.
Projector: a small MLP (1-2 layers) that maps from the vision encoder’s hidden dim to the LLM’s hidden dim. Trained from scratch during VL training.
LLM: a standard text-only LLM, often unchanged architecturally. Sometimes fine-tuned to handle the new “image token” inputs.
Tokenizer: same as the underlying LLM, with special tokens added for <image> placeholders.

The prompt format for a VL model looks like:

<image>
What is in this image?

The <image> is a special token that the runtime replaces with the actual image features at inference time. The user provides the image as a separate input; the runtime encodes it through the vision encoder and projector and inserts the result into the input sequence at the position of the <image> placeholder.

32.6 Inference cost — why VL prefill is huge

This is the punchline of the chapter and the reason multimodal serving is qualitatively different from text-only serving.

For a text-only chat request, the prompt is typically 100-2000 tokens. Prefill is fast (sub-second on H100). Decode dominates the user-perceived latency for long generations.

For a vision-language request with a single 1024×1024 image, the prompt becomes:

Text portion: ~100 tokens
Image portion: ~1000-2000 tokens (depending on the model’s tokenization scheme)
Total: ~1100-2100 tokens

For a request with multiple images or high-resolution input, the image token count can balloon to 5000-20000+ tokens. A document VQA task with multiple pages might process 30k+ image tokens.

Prefill compute scales linearly with input length (Chapter 21). A VL request with 10k image tokens has 10× the prefill cost of a text-only request with 1k tokens. TTFT goes up dramatically.

Concrete numbers for Qwen2.5-VL 7B serving on an H100:

Request type	Input tokens	TTFT
Text-only (500 tokens)	500	~0.3 s
VL with one 512×512 image	~1100	~0.6 s
VL with one 1024×1024 image	~3000	~1.5 s
VL with multi-image (4 images)	~12000	~6 s
Document VQA (5-page doc)	~30000	~15 s

A document-level VL request can take 15+ seconds just for prefill before the first response token is emitted. This is much worse than text-only serving and changes the latency picture entirely.

TTFT scales linearly with total input token count — a 30k-token document VQA request has 50× the TTFT of a text-only chat request, shifting the bottleneck from decode to prefill.

The decode cost is also higher because the KV cache is bigger:

KV cache for 30k tokens × 320 KB/token (Llama 3 70B) = ~9.6 GB per request

That’s a huge KV cache for one request. If you’re serving multiple users with multimodal inputs, the KV cache memory pressure is intense.

The implications for serving:

(1) VL workloads are prefill-bound. TTFT dominates user-perceived latency. Optimizing prefill (chunked prefill, prefix caching where applicable) is the priority.

(2) Disaggregated serving wins big. Recall from Chapter 21 the prefill/decode asymmetry. For VL with 1000+ prefill tokens per image, the asymmetry is extreme. Disaggregating prefill onto separate GPUs (Chapter 36) gives a much bigger payoff for VL than for text-only. The DistServe and Splitwise papers show this clearly. The agent-foundry-style production benchmarks show 30–50% per-GPU throughput gain from disaggregation on VL workloads, vs ~0% for short text workloads.

(3) Context length budgets are different. A “8k context” VL model can only fit a few images plus a small text prompt. You need much longer context (32k+) for serious multi-image use cases.

(4) Caching is harder. Image tokens are unique per image — no two images produce the same token sequence. Prefix caching helps for the system prompt but not for the image content. RAG-style caching of common images can help if the workload has known recurring images.

(5) Hardware preferences shift. VL workloads are more compute-bound (because of the heavy prefill) than text-only. Compute throughput matters more relative to HBM bandwidth.

32.7 Audio tokenization

Audio is processed similarly: turn it into a sequence of tokens, feed it through the transformer.

The two main approaches:

Spectrogram-based encoders (Whisper-style)

Whisper (OpenAI, 2022) encodes audio by:

Convert the waveform to a log-mel spectrogram (a time-frequency representation).
Run a convolutional projection to get patch-like vectors.
Run a transformer encoder over the spectrogram patches.
The output is contextualized vectors that downstream layers can use.

Whisper’s encoder produces ~50 vectors per second of audio. A 30-second clip becomes ~1500 audio tokens. For multimodal LLMs that take audio input, those 1500 tokens are inserted into the LLM’s input sequence (similar to how image tokens are inserted).

Whisper itself was trained as an encoder-decoder for speech recognition. The encoder is frequently reused as an audio encoder for multimodal LLMs (e.g., Qwen-Audio), where it provides the “audio understanding” component.

Waveform-based encoders (HuBERT, w2v-BERT)

A different approach: encode the raw waveform directly without going through a spectrogram. HuBERT (Hsu et al., 2021) and wav2vec 2.0 are the canonical examples. They use convolutional layers + transformers operating on the raw 16kHz audio.

These produce roughly the same number of tokens per second as Whisper but are more flexible for non-speech audio.

For modern multimodal LLMs, Whisper-style encoding is more common because Whisper is mature, widely available, and produces high-quality speech embeddings.

Audio follows the same tokenize-and-concatenate pattern as images — 30 seconds of speech becomes ~1,500 audio tokens via Whisper's spectrogram encoder, then slots into the LLM's input sequence.

Audio output

The reverse direction — generating audio from a model — is harder. The dominant approach as of 2025 is discrete audio tokens: discretize audio into a small vocabulary (using vector quantization or RVQ), train a transformer to predict next audio token, decode the tokens back to waveform with a separate decoder. This is how SoundStorm, Bark, and similar TTS models work.

The full closed-loop “audio in, audio out” model is just starting to get traction in 2025. Models like GPT-4o (Voice mode), Moshi, and Mini-Omni are early examples. They typically run audio encoder → LLM → audio decoder, with the LLM in the middle producing both text-like and audio-token-like outputs.

32.8 Video models

Video is image+time. The standard approach: sample frames from the video at some rate, encode each frame with a vision encoder, and treat the resulting sequence of frame-embeddings as a long input sequence to the LLM.

A 1-minute video at 1 frame/sec gives 60 frames × ~256 tokens/frame = 15,360 video tokens for a single minute of video. Long videos quickly exceed any reasonable context window.

Modern video models (Qwen2.5-VL with video, GPT-4V with video, Gemini’s video understanding) use various tricks to compress this:

Lower frame rate. Sample at 0.5 or 0.25 fps for long videos.
Token reduction. Pool image tokens to fewer per frame (e.g., 64 instead of 256).
Temporal compression. Use a 3D vision encoder that processes spatiotemporal patches instead of per-frame patches.

Video understanding is currently the most compute-hungry multimodal use case. The serving cost per request can be 10-100× the cost of a text-only request. Production deployments are mostly for short clips (under 1 minute) until the techniques mature further.

32.9 The serving implications

The high-level summary of how multimodal changes serving:

(1) Prefill costs balloon. A single image is 200-2000 tokens of prefill. Multi-image and document workloads can exceed 30k tokens of prefill per request.

(2) TTFT dominates user-perceived latency. Where text-only chat is decode-bound (TPOT matters most), VL is prefill-bound (TTFT matters most).

(3) KV cache pressure is higher. Image tokens contribute to the KV cache the same as text tokens, but there are many more of them.

(4) Disaggregated serving wins more. The prefill/decode asymmetry is extreme for VL, making disaggregation a clearer cost win.

(5) Hardware preferences shift. VL is more compute-bound; text-only is more memory-bound. The optimal hardware for each is different.

(6) Caching is harder. Image tokens are unique per image. Prefix caching helps for the system prompt, not for image content.

(7) Context length matters more. A “32k context” model only handles a few high-resolution images. Long-context capability is more important for VL than for text-only.

(8) Tokenizer-equivalent for images is brittle. The image-to-tokens pipeline (vision encoder + projector) is fragile. Changing the resolution, the patch size, or the projector retroactively breaks the model. Pin everything.

The skill for serving multimodal at scale is recognizing that it’s not just text serving plus more tokens. It’s a different workload with different bottlenecks, different SLOs, and different hardware preferences. The next chapter (Chapter 33) starts the research-frontier stage of Part III, but multimodal is one of the places where research and production overlap most heavily.

32.10 The mental model

Eight points to take into Chapter 33:

Multimodal works by tokenizing every modality. Images become patch tokens, audio becomes spectrogram tokens, video becomes frame tokens.
Vision encoders (ViT/CLIP/SigLIP) produce embeddings aligned with text. The LLM treats them as just more tokens.
Early fusion is the modern default. Image tokens and text tokens go through the same transformer.
A single image is 200-2000+ tokens. High-resolution and multi-image inputs can exceed 30k tokens.
Prefill explodes for VL. TTFT dominates instead of TPOT.
KV cache pressure is higher because of the large number of image tokens.
Disaggregated serving wins big for VL because of the extreme prefill/decode asymmetry.
Audio and video are similar. Encoder produces tokens; transformer processes them; same architecture.

In Chapter 33 we open Stage 3 (research frontier) with the attention compression family — the architectural changes that shrink the KV cache.

Read it yourself

Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT, 2020).
Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021).
Zhai et al., Sigmoid Loss for Language Image Pre-Training (SigLIP, 2023).
Liu et al., Visual Instruction Tuning (LLaVA, 2023).
Bai et al., Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (2023).
Wang et al., Qwen2.5-VL Technical Report (2025).
The Whisper paper (Radford et al., 2022).
The SmolVLM paper for an efficient multimodal architecture.

Practice

Compute the number of image tokens for a 1024×1024 image with patch size 14 (used by SigLIP-Large). What about 224×224?
Why is early fusion preferred over cross-attention in modern open VL models? Argue both sides.
For a Qwen2.5-VL 7B serving deployment, estimate the TTFT for a request with 5 high-resolution images. Use the §32.6 numbers.
Why does disaggregated serving help more for VL than for text-only? Trace the prefill/decode cost for both.
The KV cache for a single 30k-token VL request on Llama 3 70B is ~9.6 GB. How many such requests can run concurrently on a 2×H100 setup?
Why is video understanding an order of magnitude more compute-hungry than image understanding? Compute the per-second cost.
Stretch: Run a small open VL model (e.g., Qwen2.5-VL 3B) on a few images and measure TTFT for each. Compare with the same model serving text-only requests.

Concept check