Skip to content

Fine-Tuning Overview

A pre-trained LLM knows how to generate text. Fine-tuning teaches it what kind of text to generate for your specific task. The difference between a model that sometimes does what you want and one that reliably does what you need is often a few hundred well-chosen training examples.

Learning objectives

  • Distinguish full fine-tuning, PEFT, and inference-time techniques (RAG, prompting)
  • Understand what gradient updates actually change in a transformer
  • Know the cost difference between full fine-tuning and LoRA
  • Identify which technique addresses which failure mode

What fine-tuning changes

Pre-training teaches a model the statistical structure of language. Fine-tuning specializes it:

Pre-training:   "Complete this text in any plausible way"
Instruction FT: "Follow instructions in the format shown"
Task FT:        "Extract entities from customer support tickets"
Style FT:       "Write in the formal tone of legal documents"

Mechanically, fine-tuning runs gradient descent on labeled examples, updating model weights to minimize loss on your specific task. The backbone stays the same; the weights shift toward your distribution.


The four options

                    Data needed    Cost      Latency    Customization
─────────────────────────────────────────────────────────────────────
Prompting           0 examples    $0        0ms extra  Low
RAG                 0 examples    Storage   +50ms      Medium (domain)
PEFT (LoRA/QLoRA)   100–10K       $10–200   0ms        High
Full fine-tuning    10K–100K      $500–10K  0ms        Highest

Prompting: Free, instant, but limited by the model's pre-existing behaviors. Works for general tasks the model already does well.

RAG: Injects up-to-date knowledge at query time without changing weights. Best when the problem is what the model knows, not how it behaves.

PEFT (LoRA/QLoRA): Updates a small fraction of parameters (1–3% of total). Cost is 10–100x lower than full fine-tuning. Suitable for most task adaptation.

Full fine-tuning: Updates all parameters. Requires large datasets and significant compute. Justified only when PEFT isn't sufficient or you're training from scratch.


The training loop (simplified)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

# One gradient step (conceptually)
def training_step(batch: dict) -> float:
    input_ids = batch["input_ids"].to(model.device)
    labels = batch["labels"].to(model.device)

    outputs = model(input_ids=input_ids, labels=labels)
    loss = outputs.loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

For instruction fine-tuning, labels mask out the prompt tokens — you only compute loss on the completion tokens. The model learns to generate the right output, not to predict the prompt.


Instruction fine-tuning format

Most modern fine-tuning uses a chat template. The model learns to follow the structure:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful customer support agent."},
    {"role": "user", "content": "My order hasn't arrived."},
    {"role": "assistant", "content": "I'm sorry to hear that. Can you share your order number?"}
]

formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
print(formatted)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are a helpful customer support agent.<|eot_id|>...

The formatted string becomes one training example. Thousands of such examples, run through gradient descent, shift the model toward your target behavior.


Supervised fine-tuning vs RLHF

Stage What it optimizes Data type Used for
Supervised FT (SFT) Minimize loss on human-written outputs Input → output pairs Task adaptation
Reward Modeling (RM) Predict human preference score Ranked pairs RLHF stage 1
RLHF / PPO Maximize reward model score Model outputs + scores Alignment
DPO Maximize preferred over rejected Preferred/rejected pairs Alignment (simpler)

For task-specific fine-tuning (classification, extraction, style), SFT on labeled examples is almost always the right approach. RLHF and DPO are for alignment — teaching models to be helpful and harmless.

Most task fine-tuning uses SFT only

Unless you're training a general assistant, supervised fine-tuning on high-quality labeled examples is sufficient. RLHF adds significant complexity and requires a reliable reward signal — hard to get right.


Common misconceptions

Fine-tuning doesn't add knowledge

Fine-tuning changes behavior, not knowledge. If the model doesn't know a fact before fine-tuning, it won't know it after. Use RAG to inject knowledge; use fine-tuning to change output format, style, or task-following behavior.

More data isn't always better

100 high-quality, diverse examples often outperform 10,000 noisy examples. Fine-tuning amplifies the patterns in your data — bad data teaches the model bad patterns.


00-agenda | 02-lora-and-qlora