Chapter 30: Cost modeling for inference: tokens, GPUs, dollars

In Chapters 21–29 we built up the technical understanding of how LLM inference works, why it has the bottlenecks it does, and which optimizations apply. In this chapter we put dollar signs on it. By the end you will know:

How to estimate the per-million-token cost of serving any model on any hardware.
The capacity planning math interviewers expect from a senior candidate.
Why the answer to “is it cheaper to self-host or use the API?” depends entirely on utilization.
How to compare hardware (H100 vs H200 vs B200) on a $-per-token basis.
The cost-quality-latency frontier for serving choices.

This chapter is the most “interview-favorite” of Part III. The capacity-planning math here is asked in every senior ML systems interview, in some form, and the candidates who can do it cold separate themselves from the rest.

Outline:

The cost model.
Hardware costs: cloud vs reserved vs amortized.
The maximum throughput formula.
Real-world utilization vs peak.
Per-million-token cost in the limit.
Compute-bound vs memory-bound cost.
Self-hosting vs API: when each wins.
Comparing hardware: H100 vs H200 vs B200.
Quality vs cost vs latency trade-offs.
The capacity planning rubric.

30.1 The cost model

The basic cost model for serving LLMs:

$/million_tokens = (hardware_cost_per_hour) / (tokens_per_second × 3600 / 1_000_000)
                 = (hardware_cost_per_hour × 1000) / (tokens_per_hour / 1000)

Or, more practically:

$/million_tokens = (cost_per_hour) / (millions_of_tokens_per_hour)

You need three numbers to compute this:

Hardware cost per hour. Cloud rates: H100 ~$2.5–$4/hour on-demand, ~$1.5–$2/hour reserved. Bare-metal or capex-amortized: ~$1/hour or less.
Tokens per second the hardware can produce. This is the throughput, and it depends on the model, the batching, the workload, and the optimizations applied.
Conversion to per-million-tokens. Just unit math.

For a 70B model on a single H100 at $3/hour, producing 1000 tokens/sec total throughput:

1000 tokens/sec × 3600 sec/hour = 3.6 million tokens/hour
$3 / hour ÷ 3.6 million tokens/hour = $0.83 per million tokens

So roughly $0.83 per million tokens of output for 70B Llama on a single H100. Compare to OpenAI’s GPT-4o at ~$10 per million output tokens, or Claude Sonnet at ~$15 per million. Self-hosting is 10–20× cheaper per token if you can keep the GPU at high utilization.

The catch is that “can keep at high utilization” is the hard part. We’ll get to that in §30.4.

Cost per million tokens is a simple ratio, but utilization is the variable that moves it 2–5× in practice — a GPU sitting idle still costs the same per hour.

30.2 Hardware costs

The cost-per-hour of a GPU depends on how you procure it.

Cloud on-demand. Pay-as-you-go. The most expensive option but the most flexible. Late 2025 rates:

GPU	On-demand $/hour
A100 (40GB)	~$1.50
A100 (80GB)	~$2.50
H100	~$3.00–$4.50
H200	~$4.50–$6.00
B200	~$8.00–$12.00 (preview pricing)

These are wide ranges because pricing varies by cloud provider, region, and contract terms.

Cloud reserved (1-year, 3-year commits). Roughly half the on-demand price. You commit to paying for the GPU regardless of usage, so it’s cheaper if you can fill it.

Spot/preemptible. Half the reserved price (so quarter of on-demand), but the cloud can take the GPU back at any time. Useful for batch workloads that can tolerate interruption; not useful for low-latency serving.

Bare metal (direct lease). Lambda Labs, CoreWeave, Crusoe, Voltage Park. Often 30–50% cheaper than the major clouds for the same hardware. Less integration with cloud services.

Self-owned (capex). A new H100 SXM GPU is roughly $30,000-$40,000. Amortized over 3 years at the typical depreciation schedule, plus power, cooling, and operations, the effective cost is roughly $1.50–$2/hour. The cheapest option but requires upfront capital and operational expertise.

For most companies, reserved cloud at ~$2/hour is the realistic number for cost calculations. Spot is unreliable for serving; on-demand is too expensive for sustained workloads; capex is for hyperscalers.

For the rest of this chapter I’ll use $2/hour per H100 as the default to make the numbers concrete.

30.3 The maximum throughput formula

For a memory-bandwidth-bound workload (which decode is — Chapter 21), the theoretical maximum decode throughput is:

max_decode_tokens_per_sec = (HBM_bandwidth) / (model_size_bytes) × batch_size

For a 70B model in bf16 (140 GB) on H100 (3 TB/s HBM):

max_decode_tokens_per_sec = 3000 GB/s / 140 GB × batch_size
                          = 21.4 × batch_size tokens/sec

So at batch size 1, the maximum decode throughput is ~21 tokens/sec. At batch size 100, it’s ~2140 tokens/sec total (across all 100 requests, ~21 per request). At batch size 256, it’s ~5500 tokens/sec total.

The linear scaling holds until you saturate either the compute (which won’t happen until very large batch sizes for dense decode) or the GPU memory (the KV cache for 256 sequences is significant — Chapter 22).

Decode throughput scales linearly with batch size (memory-bandwidth-bound) while prefill throughput is a flat ceiling set by peak compute — the two regimes need different formulas.

For prefill (compute-bound at AI ≈ 700), the throughput formula is different:

max_prefill_tokens_per_sec = peak_TFLOPs / (2 × params)

For a 70B model in bf16 on H100 (peak ~989 TFLOP/s):

max_prefill_tokens_per_sec = 989 TFLOPs / (2 × 70 GFLOPs/token) ≈ 7000 tokens/sec

So a single H100 can prefill ~7000 tokens/sec at theoretical peak. In practice, achievable is more like 30-40% of peak (≈ 2500 tokens/sec). For a typical 1000-token prefill, that’s ~0.4 sec of prefill time.

30.4 Real-world utilization vs peak

The headline numbers above are theoretical peak. Real serving achieves significantly less. The reasons:

(1) Overhead. The framework (vLLM, etc.) has overhead per step: scheduling, memory allocation, cache lookups, etc. This isn’t free.

(2) MFU/MBU (Model FLOP / Bandwidth Utilization). The compute and memory subsystems aren’t 100% utilized. For prefill, MFU is typically 30-50%. For decode, MBU is typically 60-85%.

(3) Variable batch composition. Real workloads have a mix of prefill and decode tokens in flight. The overall throughput is some weighted average, not the maximum of either.

(4) Cache misses and admission delays. When the KV cache fills, the scheduler has to evict or queue, which costs time.

(5) Empty intervals. When traffic isn’t constant, the GPU has idle moments waiting for new requests. The average utilization is lower than the peak utilization.

The realistic decode throughput for a 70B model on a single H100 with vLLM at high concurrency is ~500–1500 tokens/sec total, depending on batch size and workload mix. That’s significantly lower than the 5500 tokens/sec theoretical peak at batch=256, but it’s the number you actually get.

For cost calculations, you should use the realistic number, not the peak. The peak is what hardware can deliver in a benchmark; the realistic is what your fleet actually delivers in production.

30.5 Per-million-token cost

Putting it together for a 70B model on a single H100 at $2/hour, achieving 1000 tokens/sec realistic throughput:

1000 tokens/sec × 3600 = 3.6M tokens/hour
$2 / 3.6M = $0.56 per million tokens

So the realistic cost is roughly $0.55 per million tokens of output for self-hosted 70B Llama on a single H100. Some refinements:

At lower utilization (say 50%), the cost doubles: $1.11 / M tokens. The fixed hardware cost is the same; you’re producing fewer tokens to amortize it over.
At higher batch sizes (more concurrency), you might push throughput to 1500-2000 tokens/sec, dropping the cost to ~$0.30 / M tokens.
With AWQ INT4 quantization, throughput might be 2-3× higher (because the model is smaller and decodes faster), dropping the cost proportionally.
With FP8 on H100, similar 2× speedup over bf16.
With TP=2 for lower per-token latency, throughput per replica drops slightly but you serve fewer concurrent requests per replica. Per-million-token cost is similar.

A reasonable cost-per-million-tokens table for 70B serving:

Configuration	$/M tokens (output)
70B bf16, single H100, low util (30%)	~$1.50
70B bf16, single H100, high util (80%)	~$0.50
70B FP8, single H100, high util	~$0.25
70B INT4, single H100, high util	~$0.20
70B FP8, single H200, high util	~$0.20
70B INT4, single H200, high util	~$0.15

At high utilization, self-hosted 70B costs $0.20–$0.50/M tokens versus $10+ for frontier APIs — a 20-50× gap that shrinks only when utilization drops or operational overhead is included.

These are rough numbers but they’re the right order of magnitude. Compare to API prices:

API	$/M output tokens
GPT-4o	~$10
Claude Sonnet	~$15
Llama 3.3 70B (Together, Fireworks)	~$0.50–$1
GPT-4o-mini	~$0.60
Claude Haiku	~$1.25

The “managed Llama 70B” hosting providers (Together, Fireworks, DeepInfra, etc.) charge prices very close to the self-hosted cost. They have the operational scale to keep utilization high, which is the key to their pricing.

30.6 Compute-bound vs memory-bound cost

The cost model splits along the prefill/decode boundary.

Prefill cost. Prefill is compute-bound. The cost per token is roughly:

prefill_cost_per_token = (cost_per_hour) / (TFLOPs × 3600 × utilization / (2 × params))

For a 70B model on H100 at $2/hour with 40% MFU:

= $2 / (989e12 × 3600 × 0.4 / (2 × 70e9))
= $2 / 10.2 trillion prefill tokens per hour
= $0.20 per million prefill tokens

Decode cost. Decode is memory-bound. The cost per token is roughly:

decode_cost_per_token = (cost_per_hour) / (HBM_bandwidth × 3600 × utilization / model_size × batch_size)

For a 70B model on H100 at $2/hour with 70% MBU at batch=128:

= $2 / (3e12 × 3600 × 0.7 × 128 / 140e9)
= $2 / 6.9M tokens per hour
= $0.29 per million decode tokens

These numbers are similar to each other for a typical workload. If your prompts are much longer than your completions (RAG-heavy), prefill cost dominates; if completions are much longer (reasoning), decode dominates. In the typical chat case (similar prompt and completion lengths), they balance.

OpenAI’s pricing reflects this asymmetry: input tokens are cheaper than output tokens, often by 2-4×. This is because prefill tokens cost the provider less per token than decode tokens (the math above), and they pass the saving through.

30.7 Self-hosting vs API: when each wins

The question of “should we self-host or use the API?” reduces to a math problem.

Self-host wins when:

You have predictable, sustained traffic. The economics depend on keeping utilization high; API providers have aggregate traffic that smooths out idle periods, you don’t.
Your request volume justifies the fixed cost. A single H100 serving 70B costs ~$1500/month at reserved cloud prices. You need enough traffic to amortize that.
You want specific model versions or fine-tunes that the API doesn’t offer.
You have data sovereignty / privacy requirements that make API calls infeasible.
You have operational maturity to run inference fleets (autoscaling, monitoring, on-call, etc.).

API wins when:

Your traffic is bursty or unpredictable. The API handles spikes; self-hosting wastes money on idle GPUs.
You’re building a prototype and don’t want to manage infra.
You want the strongest model available (frontier closed models like GPT-4 and Claude Sonnet are not self-hostable).
Your request volume is low. Below ~50 million tokens/month, the API is almost always cheaper than self-hosting.
You don’t have operational capacity to run GPU fleets.

The crossover point for 70B Llama 3 is around 100–500 million tokens/month, depending on how good your operations are. Below that, the API is cheaper; above that, self-hosting wins.

Self-hosting has a fixed monthly floor (GPU reservation) plus a shallow per-token slope; API has no fixed cost but a steeper slope — the crossover near 100–500M tokens/month is where the economics flip.

For frontier models (GPT-4, Claude Sonnet 3.5+), you can’t self-host. The choice is just “which API.”

30.8 Comparing hardware: H100 vs H200 vs B200

The cost-per-token ratios across recent NVIDIA GPUs, for 70B model serving:

GPU	HBM	Bandwidth	Approx $/hour	$/M tokens (decode-heavy)
A100 80GB	80 GB	2 TB/s	$2.50	$0.80
H100 80GB	80 GB	3 TB/s	$2.00	$0.50
H200 141GB	141 GB	4.8 TB/s	$3.00	$0.45
B200 192GB	192 GB	8 TB/s	$5.00	$0.45

Several observations:

(1) Newer GPUs aren’t always cheaper per token. B200 has twice the bandwidth of H100 but costs 2.5× more. The per-token cost is similar.

(2) H200 is the sweet spot in late 2025 for memory-bound workloads. It has H100 compute with 60% more bandwidth at similar price per dollar.

(3) For models that fit in H100’s 80GB, H100 is fine. For 70B in bf16 (which doesn’t fit on one H100), you need H200 or H100×2. H200 is cheaper for this case.

(4) FP8 quantization changes the picture. A 70B model in FP8 fits on a single H100 (35 GB). At FP8 throughput, the cost-per-token drops dramatically. The hardware choice becomes “H100 with FP8 quantization” rather than “H200 to fit bf16.”

(5) For training, the picture is different — training is much more compute-heavy and the per-FLOP cost matters more. But this chapter is about inference.

30.9 Quality vs cost vs latency

Three trade-offs:

(1) Quality vs cost. A smaller model is cheaper per token but lower quality. The trade-off is workload-specific. For simple tasks (classification, structured extraction), a 7B model is often enough and 10× cheaper than a 70B. For hard reasoning, a 70B might be necessary.

(2) Quality vs latency. A smaller model is faster per token but lower quality. Same trade-off, different metric.

(3) Cost vs latency. Higher concurrency (bigger batches) gives lower cost per token but higher per-request latency (because the larger batch means longer per-step time). At the limit, batching aggressively gives the lowest cost-per-token but unacceptable user-perceived latency.

In production, you tune these to match your use case:

Latency-critical interactive chat: smaller batches, lower concurrency, more replicas. Higher cost per token but lower TPOT.
Background batch processing: large batches, high concurrency, fewer replicas. Higher TPOT but lower cost per token.
Mixed workload: different SLA tiers running at different batch sizes.

The art is in matching the configuration to the workload.

30.10 The capacity planning rubric

The interview question version: “You need to serve 1 million users with an average of 10 chat sessions per user per day, each session producing 500 tokens of output. Size the fleet.”

The math:

Total tokens per day: 1M × 10 × 500 = 5B tokens/day
Tokens per second on average: 5B / 86400 ≈ 58k tokens/sec
Peak (assume 3× average for daytime peak): ~175k tokens/sec
Per H100 throughput (70B bf16, realistic): ~1000 tokens/sec
Required GPUs at peak: 175k / 1000 = 175 H100s
Add 30% headroom for spikes and rolling deployments: ~225 H100s
Cost: 225 × $2/hour × 24 × 30 = ~$324k/month

So roughly $324k/month and 225 H100s to serve 1M users with 70B Llama. Or in $/M-token terms: $324k / 5B × 1M / 30 ≈ $2.16 / M tokens. (The cost per token here is higher than the §30.5 number because we sized for peak, not average — peak utilization is lower.)

A senior candidate can do this math in their head in two minutes. A mid-level candidate gets stuck on “how do I estimate the per-GPU throughput?” The answer: use the formulas in §30.3 and §30.4, plus the realism factors.

The same math runs in reverse for “how many users can our existing fleet support?” or “if we switch to FP8 quantization, how many GPUs can we cut?” — these are common production planning questions.

30.11 The mental model

Eight points to take into Chapter 31:

Cost = hardware-cost-per-hour / tokens-per-hour. Memorize this.
Decode is memory-bandwidth-bound. Throughput scales with HBM_bandwidth × batch / model_size.
Prefill is compute-bound. Throughput scales with peak_TFLOPs / (2 × params).
Realistic utilization is 30–80% of peak, depending on workload and tooling.
Self-hosting is 10–20× cheaper than the API when you can keep the GPU at high utilization.
The crossover is around 100M–500M tokens/month. Below, API. Above, self-host.
Quantization is the biggest cost lever. FP8 or INT4 cuts cost-per-token by 2–4×.
Capacity planning is back-of-envelope math with the formulas in §30.3 and §30.4.

In Chapter 31 we look at the latency side: tail latency, queueing, and the p99 problem.

Read it yourself

Pope et al., Efficiently Scaling Transformer Inference (2022) — section on cost analysis.
The vLLM benchmarks in their GitHub repository. Realistic numbers for various models and configurations.
The Together AI / Fireworks AI / DeepInfra pricing pages — see what production hosting actually charges.
The Anthropic API pricing page for input vs output token cost ratios.
Lambda Labs and CoreWeave pricing for bare-metal H100 rates.

Practice

Compute the cost per million output tokens for Llama 3 8B on a single H100, assuming 5000 tokens/sec realistic decode throughput at $2/hour.
Compute the same for Llama 3 405B in FP8 on 8×H100 (TP=8), assuming 800 tokens/sec total throughput at $16/hour.
A workload requires 100 million tokens per day. Should you self-host on H100s or use Together AI’s hosted Llama 3 70B at $0.88/M tokens? Show your math.
You have 8 H100s. How many concurrent users can you serve with 70B Llama at 200ms TPOT? With 500ms TPOT?
Why does FP8 quantization roughly double the cost-effectiveness of decode? Trace through the bandwidth math.
A startup is running a chat app with bursty traffic — peaks of 10M tokens/hour during the day, near zero overnight. Should they self-host or use the API? Justify.
Stretch: Look up the current cloud pricing for H100s at three providers. Compute the cost-per-million-tokens for 70B Llama at each, assuming 1000 tokens/sec realistic throughput. Compare to the prices on Together AI and DeepInfra.

Concept check