Chapter 15: Fine-tuning: full, LoRA, QLoRA, adapters, the PEFT family

Pretraining (Chapters 11–14) gives you a base model with general capabilities. Fine-tuning adapts that base to a specific task or domain. Where pretraining costs hundreds of millions of dollars and weeks of wall-clock, fine-tuning costs anywhere from a few hundred dollars (a small LoRA on a single GPU) to tens of thousands (a full fine-tune of a 70B model on a small cluster). For 99% of teams that want a “their” LLM, fine-tuning is the entire game.

This chapter covers the methods of fine-tuning: full fine-tuning, the LoRA family, QLoRA, and the broader PEFT (Parameter-Efficient Fine-Tuning) zoo. The next chapter (16) covers what you fine-tune for: SFT and instruction tuning. The chapter after that (17) covers the alignment techniques (RLHF, DPO) that come on top.

By the end you will know:

When to fine-tune vs prompt vs RAG.
The math of LoRA and why low-rank decomposition of weight deltas works.
The full PEFT zoo: LoRA, QLoRA, adapters, prefix tuning, prompt tuning, IA³.
How to pick a method for a budget.
The merging-vs-serving-with-adapters question.

Outline:

The “should you fine-tune?” filter.
Full fine-tuning — what it costs.
LoRA — the math, the rank choice, the targeted modules.
Merging vs serving with adapters.
QLoRA — 4-bit base + LoRA, and why it works.
Adapters and the PEFT zoo: prefix tuning, prompt tuning, IA³.
The HuggingFace peft library in 20 lines.
The serving question: one model with N adapters.
Common fine-tuning bugs.

15.1 The “should you fine-tune?” filter

Before any of the methods in this chapter, ask the questions in this order:

(1) Can you do it with prompting? A well-engineered system prompt with a few-shot example or two often gets you 80% of the way to a fine-tuned model, with zero training cost and instant iteration. The senior move is to exhaust prompt engineering before reaching for fine-tuning.

(2) Can you do it with RAG? If your problem is “the model doesn’t know X,” the answer is almost always retrieval, not fine-tuning. Fine-tuning teaches a model patterns, not facts. New facts are best handled by retrieving them at inference time and putting them in the context window. We’ll cover this in detail in Chapter 65.

(3) Can you do it with a better prompt + a better base model? Sometimes the right answer is “use a stronger base.” If you’re using a 7B model and you can afford to switch to a 70B, the 70B with the same prompt often beats the 7B fine-tune.

(4) Now consider fine-tuning. Fine-tuning is the right tool when:

You have a specific task (classification, translation, summarization in a narrow domain) where the model needs to consistently produce a specific format or follow a specific style.
You have enough data — usually a few hundred to a few thousand high-quality examples for a small task; tens of thousands or more for a meaningful instruction tune.
You need reproducibility — the fine-tuned model behaves consistently on the task without prompt fragility.
You need cost at scale — a fine-tuned 7B model can replace a prompted 70B model for a fraction of the inference cost.

These are real reasons. Most teams that ask about fine-tuning don’t have any of them; they want fine-tuning because it sounds more “real” than prompting. The senior move is to push back, hard, before committing to a fine-tune.

15.2 Full fine-tuning — what it costs

Full fine-tuning updates every parameter in the model. This means you need optimizer state and gradients for the entire model — same memory requirements as pretraining (Chapter 12). For a 70B model:

~140 GB weights in bf16
~140 GB gradients in bf16
~280 GB AdamW first moment in fp32
~280 GB AdamW second moment in fp32
~280 GB master weights in fp32
= ~1.1 TB of GPU memory before activations

Plus activations, which scale with batch size and sequence length. To full-fine-tune a 70B model, you need a multi-GPU cluster with sharded data parallelism (FSDP) and probably some tensor parallelism. This is non-trivial. It’s probably $5k–$50k of compute for a single full-fine-tune run on a 70B model, depending on the dataset size.

For a 7B model, full fine-tuning is much more tractable. The total memory is ~110 GB, which fits on two H100s with FSDP. A 7B fine-tune on 100k examples is a few hundred dollars of compute.

Full fine-tuning is the right choice when:

The model is small enough that the cost is acceptable.
You have lots of data and want the model to use all of its capacity to learn from it.
The output should be different enough from the base that adapter-style methods can’t capture it.

Full fine-tuning is not the right choice when:

The model is big and you don’t want to spend $10k+ per experiment.
You want to fine-tune on multiple tasks and serve them all without holding multiple full model copies.
You’re iterating quickly and need cheap experiments.

For these cases, you want parameter-efficient fine-tuning (PEFT).

15.3 LoRA — the math

LoRA (Low-Rank Adaptation, Hu et al., 2021) is the most influential PEFT method, by a large margin. The idea is dead simple, and it works.

The observation: when you fine-tune a pretrained model, the change in each weight matrix from the pretrained value is much lower-rank than the matrix itself. In math: if W₀ is the pretrained weight and W_finetuned = W₀ + ΔW, then ΔW has effective rank much smaller than min(d_in, d_out). Empirically, fine-tuning deltas have ranks in the dozens, not the thousands.

LoRA exploits this by representing the weight delta as a low-rank product:

ΔW = B A

where:

B is a matrix of shape (d_out, r)
A is a matrix of shape (r, d_in)
r is the LoRA rank, a small integer (commonly 4, 8, 16, 32, 64)

The full weight is W = W₀ + ΔW = W₀ + B A. During fine-tuning:

W₀ is frozen — no gradient, no update.
Only A and B are trained.

The number of trainable parameters per matrix:

trainable_params = r × (d_in + d_out)

For a single linear layer with d_in = d_out = 4096 and r = 16, that’s 16 × 8192 = 131,072 parameters — instead of 4096 × 4096 = 16,777,216 for full fine-tuning. A 128× reduction in trainable parameters per layer.

LoRA's 128× parameter reduction comes from representing ΔW as B·A where r ≪ d — only B and A are trained, so gradient buffers and optimizer state shrink by the same factor.

Across the whole model, the savings are similar. A 7B base model with LoRA on the attention projections and r = 16 typically has 30M–100M trainable parameters — 1–2% of the full count. Training memory drops dramatically because:

Gradients are only needed for the trained parameters (~30M × 2 bytes = 60 MB instead of 14 GB).
AdamW state is only needed for the trained parameters (~30M × 8 bytes = 240 MB instead of 56 GB).
Master fp32 weights are only needed for the trained parameters.

The base weights are still loaded in full, but they’re frozen — no gradient buffers, no optimizer state, just the static weights. A 7B LoRA fine-tune fits in 24 GB of VRAM (one consumer GPU), and a 70B LoRA fine-tune fits in 80 GB (one H100). The cost reduction vs full fine-tuning is two orders of magnitude.

Why low-rank works

The intuition for why fine-tuning deltas are low-rank: pretraining has already pushed the model into a region of weight space where the model knows almost everything about general language. The fine-tune is making a small targeted adjustment — “use this format” or “respond more concisely” or “speak in this domain’s vocabulary.” Small adjustments to a high-dimensional learned function can be expressed as low-rank perturbations.

This is not theory; it’s empirical. The Hu et al. paper measured the effective rank of full fine-tuning deltas across many tasks and consistently found ranks in the dozens. LoRA exploits this empirical fact.

The rank choice

Common LoRA ranks and their use cases:

Rank	Trainable params (7B base, attention only)	When to use
4	~10M	Tiny stylistic adjustments, very small datasets.
8	~20M	Small task-specific fine-tunes, modest datasets.
16	~40M	The standard. Default for most LoRA fine-tunes.
32	~80M	Larger fine-tunes, complex tasks, more data.
64	~160M	Approaches full fine-tuning quality on most tasks.
128	~320M	Diminishing returns vs full fine-tuning at this point.

The empirical pattern: doubling rank gives diminishing returns. Going from r=4 to r=8 is a clear quality jump; from r=64 to r=128 is small. Most production fine-tunes land at r=16 or r=32.

There’s also a hyperparameter called LoRA alpha (α). The actual update is ΔW = (α/r) × B A. The α/r factor is a scaling that decouples the learning rate from the rank — without it, doubling r would effectively double the magnitude of updates and require relearning the LR. The convention is to set α = r or α = 2r. With α = r, the scaling is 1, which is the simplest case.

Which modules to LoRA

You don’t have to LoRA every linear layer. Common targeting choices:

Attention only (Q, K, V, output projections). Most common default. Cheap and effective.
Attention + FFN (attention projections + the FFN’s gate, up, down). More parameters, generally better quality.
Everything (all linear layers including the LM head). Highest cost, marginal additional quality.

The “attention + FFN” target is the modern default for LoRA fine-tunes that want quality without going to full fine-tuning. The HuggingFace peft library calls this target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"].

The original LoRA paper recommended attention-only because it gave the best parameter efficiency. Subsequent work showed that LoRA on the FFN matters too — the FFN is where most of the model’s “knowledge” lives, and adding LoRA there gives more of a fine-tuning quality lift than adding it to attention.

15.4 Merging vs serving with adapters

Once you’ve trained a LoRA adapter, you have two options for serving it.

Option 1: Merge. Compute W_merged = W₀ + B A for each matrix, replace the base weights with the merged weights, and discard the adapter. The result is a regular fine-tuned model with no adapter overhead. Inference is identical to the base model in cost. The downside is that you’ve baked the adapter in — you can’t switch adapters at runtime, and you’ve lost the ability to load multiple adapters on the same base.

Option 2: Keep the adapter and apply it at inference. Load the base weights once, load the (small) adapter, and during forward compute W₀ x + B (A x). This adds two extra matmuls per layer compared to the base. The overhead is small (the adapter matmuls are tiny), and the killer feature is that you can swap adapters at runtime: one base model, many adapters, each adapter served as a separate “tenant.”

Multi-tenant LoRA serving loads one 140 GB base model once, then routes each request to an ~80 MB adapter — 1,750× smaller per-tenant overhead than running a full copy of the model per customer.

The second option is the foundation of multi-tenant LoRA serving, which has become a standard pattern. vLLM, SGLang, and most production serving stacks support loading multiple LoRA adapters on a single base model and routing requests to the right adapter based on a header or model name. This lets a single fleet of inference replicas serve dozens of fine-tuned variants of the same base, with negligible per-adapter overhead.

The HuggingFace S-LoRA library (Sheng et al., 2023) extended this idea further with batched LoRA serving — multiple requests from different adapters can run in the same batch with carefully scheduled GPU memory usage. This is the production technique for large multi-tenant LoRA deployments.

The merge-vs-serve decision:

Merge if you have one fine-tune per base, you don’t need runtime swapping, and you want the simplest serving stack.
Serve with adapter if you have many fine-tunes per base, you want runtime swapping, or you need multi-tenant isolation.

Most production teams end up doing both: merge for the canonical “production model,” but keep the adapters for canarying new fine-tunes and for multi-tenant scenarios.

15.5 QLoRA — 4-bit base + LoRA

QLoRA (Dettmers et al., 2023) is the most important practical fine-tuning advance of the post-2022 era. The trick: quantize the base model to 4 bits before applying LoRA.

The 4-bit base is loaded once into VRAM, takes 1/4 the memory of a bf16 base, and remains frozen. The LoRA adapters are still trained in bf16 (small enough that the precision doesn’t matter for memory). During the forward pass, the base weights are dequantized on the fly into bf16 just before the matmul, the LoRA delta is added, and the matmul runs as usual.

The memory math for a 70B base model:

Item	Standard LoRA (bf16 base)	QLoRA (4-bit base)
Base weights	140 GB	35 GB
LoRA adapters (rank 16)	~0.5 GB	~0.5 GB
LoRA gradients	~0.5 GB	~0.5 GB
LoRA optimizer state	~2 GB	~2 GB
Activations + misc	~10 GB	~10 GB
Total	~153 GB	~48 GB

A 70B model fine-tuned on a single H100 (80 GB). On a single A100 (40 GB) it’s tight but possible with careful gradient checkpointing. A few years ago this was unimaginable. Now it’s the standard way to fine-tune medium-large models on consumer or single-GPU hardware.

QLoRA’s specific technical contributions:

QLoRA's 4-bit base cuts 70B fine-tuning memory from ~153 GB to ~48 GB — moving the hardware requirement from a multi-GPU cluster to a single H100.

NF4 (4-bit NormalFloat) — a custom 4-bit quantization scheme designed for the empirical distribution of pretrained weights (which is approximately normal). NF4 has better quantization quality than uniform 4-bit at the same bit width.
Double quantization — quantize the quantization constants themselves, saving additional memory.
Paged optimizers — use NVIDIA’s unified memory to page optimizer state to CPU when GPU memory is tight, smoothing over memory spikes during training.

The quality gap between QLoRA and standard LoRA is empirically tiny on most tasks. There is some loss for very small models (where every bit of base precision matters), but at 7B and above, the gap is essentially nothing.

In production, QLoRA is the default fine-tuning method for models that don’t fit in standard LoRA’s memory budget. For 7B and 13B models on consumer GPUs, QLoRA. For 70B models on a single H100, QLoRA. For everything else — full fine-tuning if you can afford it, standard LoRA if you want adapter portability without the quantization step.

15.6 The PEFT zoo

LoRA is the dominant method but it’s not the only one. The PEFT family includes:

Prefix tuning (Li & Liang, 2021)

Instead of modifying the model weights, prepend a learned sequence of “virtual tokens” to the input. The virtual tokens have their own embeddings (and sometimes their own per-layer key/value pairs added to the attention). The base model is frozen; only the virtual tokens’ parameters are trained.

Pros: very few parameters (~10k–100k for the whole model), easy to swap. Cons: capacity is limited; doesn’t match LoRA on most tasks. Has been mostly displaced.

Prompt tuning (Lester et al., 2021)

A simpler variant of prefix tuning: only the input embedding layer’s “prefix” is learned, not per-layer KV pairs. Even fewer parameters, even less capacity. Mostly historical now.

Adapters (Houlsby et al., 2019)

Insert small bottleneck modules between transformer layers. The base layers are frozen; only the adapter modules (which have a down-projection, a nonlinearity, and an up-projection) are trained. Each adapter module is small — a few hundred thousand parameters per layer.

Adapters predate LoRA and were the first popular PEFT method. They work, but they add inference latency (you have to compute the adapter outputs sequentially, after each layer). LoRA is generally preferred because it can be merged into the base weights at inference time, eliminating the latency overhead.

IA³ (Liu et al., 2022)

Even more parameter-efficient: learn a per-channel scaling vector for the keys, values, and FFN intermediate activations. That’s it — just three vectors per layer, multiplied element-wise into the activations. The trainable parameter count is tiny (under 1% of LoRA at the same target).

IA³ is competitive with LoRA on small fine-tunes and dramatically cheaper. It’s a good choice when you want the absolute minimum parameter count, but it’s been less widely adopted than LoRA because LoRA’s merging-and-serving story is so clean.

DoRA (Liu et al., 2024)

Weight-Decomposed Low-Rank Adaptation. Decomposes each weight matrix into a magnitude vector and a direction matrix, then applies LoRA only to the direction. Empirically, gives a small but consistent quality boost over LoRA at the same parameter count. Gaining adoption in 2024–25.

The pattern across the zoo: LoRA and its variants are the dominant family. Adapters were first; prefix and prompt tuning are mostly historical; IA³ is a niche tool for ultra-low parameter budgets; DoRA is the marginal improvement people are exploring now. For 95% of fine-tunes, plain LoRA (or QLoRA) is the right choice.

15.7 The `peft` library in 20 lines

The HuggingFace peft library makes LoRA training trivial:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196

# now train as normal
trainer = Trainer(model=model, args=..., train_dataset=...)
trainer.train()

# save the adapter (small file, ~80MB for r=16 on 8B)
model.save_pretrained("./my-lora")

For QLoRA, add the BitsAndBytesConfig:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto",
)
# rest is the same — get_peft_model, train

That’s the entire production fine-tuning code. The bulk of the work in a real fine-tune is in the dataset preparation (Chapter 16) and the eval (Chapter 20), not the training loop.

15.8 Multi-tenant LoRA serving — the production pattern

The killer feature of LoRA in production is multi-tenancy. One base model, many adapters, all served from the same fleet.

The pattern:

Load the base model into GPU memory once.
Load each adapter into a small, separate memory region (~80 MB for r=16 on a 7B model).
Route each inference request to the appropriate adapter based on a header (X-Lora-Adapter: customer-acme-v3) or path.
During the forward pass, look up the adapter weights for the request’s adapter ID and apply them in addition to the base weights.

This lets a single 8-GPU node serve dozens or hundreds of customer-specific fine-tunes with the same base model. The marginal cost of adding a new adapter is the storage for ~80 MB of weights plus a small lookup overhead per request. Compare this to the alternative of running a separate model for each customer (each model would require its own GPU fleet) — the savings are 100×+.

vLLM has first-class LoRA support since version 0.3 (vllm serve --enable-lora). SGLang supports it. Most production serving stacks support it. The S-LoRA paper showed it scales to thousands of adapters per base.

The catch: not every fine-tuning workflow ends up as a LoRA. If the customer wants a full fine-tune (because they paid for it, or because the adapter doesn’t capture what they need), it has to be served as its own model. The decision ends up being economic: how much customer-specific quality do they pay for? Most pay for the LoRA; a few pay for the full fine-tune.

15.9 Common fine-tuning bugs

The fine-tuning bugs that come up over and over:

(1) Wrong chat template. You fine-tune on a dataset formatted with one template, then serve with another. The model behaves bizarrely. The fix: use tokenizer.apply_chat_template consistently (Chapter 5) and verify the dataset formatting matches what the tokenizer produces.

(2) Catastrophic forgetting. You fine-tune on a narrow task, and the model loses its general capabilities. Usually a sign of training too long, learning rate too high, or rank too high. Reduce one of the three.

(3) Loss masking error. You compute the loss on the entire input including the prompt. The model gets credit for “predicting” the tokens you fed it, which is trivial, and the gradient signal from the actual response is diluted. Fix: mask the prompt tokens out of the loss computation. This is the single most common bug in custom training scripts. (Frameworks like trl’s SFTTrainer handle this for you if you use them correctly.)

(4) LoRA target_modules don’t include the right layers. The default ["q_proj", "v_proj"] from the original paper is small. The model trains, but it underperforms because most of its capacity isn’t being touched. Use the full list (q,k,v,o,gate,up,down) for serious fine-tunes.

(5) Mixed-precision corruption. You’re training in fp16 without GradScaler, or you’ve turned off autocast somewhere. The gradients underflow and the model doesn’t learn. Switch to bf16 (Chapter 13).

(6) Quality regression on benchmarks but improvement on the task. Often expected — the model has specialized at the cost of generalization. The question is whether the regression is acceptable for your use case. Always run a multi-task eval, not just the single task you trained on (Chapter 20).

(7) Adapters trained at different ranks interfering. If you load multiple LoRAs at different ranks into the same base, they don’t compose. Either standardize on one rank or use independent serving for each.

(8) Forgetting to set model.train() / model.eval(). Standard PyTorch bug. In training, dropout and batch norm should be in train mode; in inference, they should be in eval mode. Forgetting this is one of the top bugs in custom loops.

15.10 The mental model

Eight points to take into Chapter 16:

Try prompting and RAG before fine-tuning. Most teams skip this and they shouldn’t.
Full fine-tuning updates every parameter and costs as much as a small pretraining run. Use only when justified.
LoRA decomposes the weight delta as a low-rank product B A. Trains 1% of parameters at ~1% of the memory.
Rank 16 with full target modules is the modern LoRA default.
QLoRA loads the base in 4-bit NF4 and trains LoRA on top. Default for 70B+ fine-tuning on a single GPU.
Adapters can be merged or served alongside the base. Multi-tenant LoRA serving is the production pattern.
The PEFT zoo has more methods (adapters, IA³, DoRA, prompt/prefix tuning), but LoRA dominates.
Loss masking on the prompt is the #1 fine-tuning bug. Use a framework like trl that handles it.

In Chapter 16 we cover what you train your fine-tunes for: SFT, instruction tuning, chat templates, dataset construction.

Read it yourself

Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021). The LoRA paper. Read sections 4 and 5.
Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs (2023). Read sections 3 and 4.
Sheng et al., S-LoRA: Serving Thousands of Concurrent LoRA Adapters (2023). The multi-tenant serving paper.
The HuggingFace peft library documentation and examples.
The HuggingFace trl library documentation, especially SFTTrainer.
Liu et al., DoRA: Weight-Decomposed Low-Rank Adaptation (2024) — the next-generation LoRA variant.

Practice

Compute the LoRA parameter count for r=16 on every linear layer of a 7B model. Compare to the full parameter count. What percentage of the model is trainable?
Why does LoRA work — i.e., why are fine-tuning deltas low-rank? Construct an empirical experiment to verify this on a small model.
A 70B model in QLoRA needs ~48 GB of GPU memory. Compare to standard LoRA (~153 GB) and full fine-tuning (~1.1 TB). Where do the savings come from at each step?
Train a tiny LoRA on a 7B base model for a simple task (e.g., translating English to Pirate). Use the HuggingFace peft library. Run for 100 steps, save the adapter, and load it back. The whole exercise should take an hour.
Why does the merge step (W = W₀ + B A) eliminate inference overhead? What’s the runtime cost of not merging?
You want to serve 50 customer-specific fine-tunes of Llama 3 8B on a single 8×H100 node. Sketch the serving architecture using multi-tenant LoRA. How much memory do you need? What happens at request time?
Stretch: Train a QLoRA fine-tune of a 13B model on a small instruction dataset (Alpaca, Dolly). Compare the trained model’s outputs to the base on the same prompts. Where do you see the fine-tuning effect, and where doesn’t it?

Concept check