Interactive tools

Ten calculators for the numbers interviewers expect you to know cold. Covers inference, training, cost, retrieval, and reliability.

KV cache Attention O(s^2) Latency budget Training memory LoRA params Cost per 1M tokens Quantization savings Little's Law Availability Error budget

KV cache memory calculator

Formula: 2 x layers x kv_heads x head_dim x bytes_per_param x sequence_length x batch_size

Model Layers KV heads Head dim Bytes/param Seq length Batch size

KV cache size

0 GB

Attention memory scaling

See how the O(s^2) score matrix grows with sequence length.

Sequence length

512 4,096 128k

Score matrix

16M

entries (s x s)

Memory (bf16)

32 MB

per head

vs FlashAttention

O(s)

never materializes

Inference latency budget

Estimate TTFT and total generation time.

Input tokens Output tokens Prefill tok/s Decode tok/s Queue delay (ms)

Queue

Prefill (TTFT)

Decode

Total end-to-end

Training memory estimator

Weights + gradients + optimizer state + activations. Ch 4, 12.

Parameters (B) Precision Optimizer

LoRA parameter calculator

Trainable params = (d_in + d_out) x r x num_adapted_layers. Ch 15.

d_model Rank (r) Adapted layers Full params (B)

Cost per 1M tokens

Self-hosted vs API breakeven. Ch 30.

GPU $/hr Decode tok/s Utilization % API $/1M tokens

Quantization memory savings

See how INT4/INT8/FP8 reduces model weight memory. Ch 26.

Parameters (B)

Little's Law calculator

L = lambda x W. Concurrent requests = arrival rate x avg latency. Ch 75.

Arrival rate (req/s) Avg latency (s)

Concurrent (L)

800

Compound availability

Serial dependencies multiply. A1 x A2 x ... x An. Ch 95.

Service 1 (%) Service 2 (%) Service 3 (%) Service 4 (%)

Compound availability

99.6%

Error budget burn rate

How fast you're consuming your monthly error budget. Ch 95.

SLO target (%) Current success (%) Days into month