Skip to content

When to Fine-Tune vs RAG vs Prompting

Fine-tuning is a tool, not a default. It's expensive to run once and expensive to repeat when requirements change. Before committing to a training run, exhaust cheaper options. The question isn't "should we fine-tune?" — it's "what is actually failing and what's the cheapest fix?"

Learning objectives

  • Apply the decision framework: prompting → RAG → fine-tuning
  • Identify which failure modes each technique addresses
  • Calculate when fine-tuning is cost-justified vs API usage
  • Know the three situations where fine-tuning is clearly the right choice

The decision framework

def choose_technique(problem: str) -> str:
    """
    Decision tree for choosing between prompting, RAG, and fine-tuning.
    Not a real function — use it as a checklist.
    """
    # Step 1: Have you tried prompting first?
    if not tried_prompting:
        return "Try few-shot prompting with 3–5 examples first"

    # Step 2: Is the problem knowledge (facts) or behavior (format/style)?
    if problem == "model_lacks_knowledge":
        if knowledge_changes_frequently:
            return "RAG — inject knowledge at query time"
        else:
            return "RAG or fine-tuning — depends on query volume"

    if problem == "model_behavior":
        # Behavior: format, style, task-following, domain-specific output
        if examples_available >= 100:
            return "Fine-tuning (LoRA)"
        else:
            return "Improve your prompt; collect more examples"

    # Step 3: Is latency or cost driving the decision?
    if cost_per_query_too_high:
        # Fine-tuned smaller model often cheaper than API calls to frontier model
        return "Fine-tuning a smaller model to replace large-model API calls"

    return "RAG or prompting"

What each technique actually fixes

TECHNIQUE_FIT = {
    "prompting": {
        "fixes": [
            "Model ignores your format (add explicit format instructions)",
            "Model output is too long/short (add length constraint)",
            "Model refuses valid requests (adjust system prompt)",
            "Model uses wrong tone (add tone examples in system prompt)",
        ],
        "does_not_fix": [
            "Model lacks domain knowledge",
            "Consistent behavior across hundreds of edge cases",
            "Task too complex for zero/few-shot",
        ],
        "cost": "$0 extra",
        "time_to_deploy": "Minutes",
    },
    "RAG": {
        "fixes": [
            "Model answers with outdated information",
            "Model doesn't know company-specific data",
            "Model hallucinates facts that exist in your documents",
            "Model needs to cite sources",
        ],
        "does_not_fix": [
            "Model output format/style inconsistency",
            "Model behavior on edge cases",
            "Latency-sensitive applications (RAG adds retrieval time)",
        ],
        "cost": "Vector DB + retrieval compute",
        "time_to_deploy": "Hours to days",
    },
    "fine_tuning": {
        "fixes": [
            "Consistent output format across all inputs",
            "Domain-specific tone and vocabulary",
            "Complex task-following that few-shot can't handle",
            "Reduce prompt length (bake instructions into weights)",
            "Replace expensive large model with fine-tuned small model",
        ],
        "does_not_fix": [
            "Lack of factual knowledge (use RAG for this)",
            "Need for real-time information",
            "Problems solvable with better prompting",
        ],
        "cost": "$10–500 for QLoRA; months of data collection",
        "time_to_deploy": "Days to weeks",
    },
}

for tech, info in TECHNIQUE_FIT.items():
    print(f"\n{'=' * 40}")
    print(f"Technique: {tech}")
    print(f"Fixes: {info['fixes'][0]}, ...")
    print(f"Cost: {info['cost']}")

Cost comparison: fine-tuned model vs API

When your query volume is high, a fine-tuned small model often beats API calls to a frontier model:

from dataclasses import dataclass

@dataclass
class CostScenario:
    queries_per_month: int
    avg_input_tokens: int
    avg_output_tokens: int

def api_cost_per_month(scenario: CostScenario) -> float:
    """GPT-4o pricing: $2.50/1M input, $10/1M output (as of 2025)."""
    input_cost = (scenario.queries_per_month * scenario.avg_input_tokens / 1e6) * 2.50
    output_cost = (scenario.queries_per_month * scenario.avg_output_tokens / 1e6) * 10.0
    return input_cost + output_cost

def fine_tuned_cost_per_month(
    training_cost: float = 50.0,      # One-time QLoRA training on cloud GPU
    inference_cost_per_month: float = 200.0,  # Self-hosted or dedicated endpoint
    amortize_months: int = 12
) -> float:
    return (training_cost / amortize_months) + inference_cost_per_month

scenario = CostScenario(
    queries_per_month=100_000,
    avg_input_tokens=500,
    avg_output_tokens=100,
)

api_monthly = api_cost_per_month(scenario)
ft_monthly = fine_tuned_cost_per_month()

print(f"API (GPT-4o):          ${api_monthly:,.0f}/month")
print(f"Fine-tuned (self-host): ${ft_monthly:,.0f}/month")
print(f"Break-even at: {ft_monthly / (api_monthly / 100_000):.0f} queries/month")

The three clear cases for fine-tuning

Case 1: CONSISTENT BEHAVIOR AT SCALE
  The task has a specific output format that prompting achieves 80% of the time
  but you need 98%+. Edge cases break prompt-only solutions.
  Example: structured JSON extraction from noisy text.

Case 2: LATENCY + COST REDUCTION
  You're using GPT-4o for a simple classification that a fine-tuned phi-3-mini
  could do equally well. Fine-tuning the smaller model cuts cost by 50–100x.
  Example: spam detection, ticket routing, entity tagging.

Case 3: PROPRIETARY STYLE OR DOMAIN
  Your company has a unique writing style or highly specialized domain language
  that no amount of prompting reliably replicates.
  Example: medical report generation with specific formatting requirements.

Combining techniques

Fine-tuning and RAG are complementary, not competing:

# Pattern: Fine-tuned model as the RAG reader
# - RAG retrieves relevant context (solves the knowledge problem)
# - Fine-tuned model generates consistent output format (solves the behavior problem)

def rag_with_fine_tuned_reader(
    question: str,
    retriever,
    fine_tuned_model,
    tokenizer
) -> str:
    # Step 1: RAG retrieval
    context_chunks = retriever.retrieve(question, k=5)
    context = "\n".join(context_chunks)

    # Step 2: Fine-tuned model reads and answers
    prompt = f"### Context:\n{context}\n\n### Question:\n{question}\n\n### Answer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(fine_tuned_model.device)

    import torch
    with torch.no_grad():
        outputs = fine_tuned_model.generate(
            **inputs, max_new_tokens=200, do_sample=False
        )
    return tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

Default to prompting, add RAG for knowledge, fine-tune for behavior

This ordering minimizes wasted effort. Every technique higher in the stack is cheaper and faster to implement. Only move down when the higher-level technique is genuinely insufficient.


Red flags: when fine-tuning will fail

Don't fine-tune to fix these

  • Insufficient data: Less than 50 examples for classification is almost never enough
  • Noisy labels: Inconsistent labeling (even 10% noise) degrades model quality significantly
  • Task too broad: "Be better at customer support" is not a trainable task; "classify ticket category" is
  • Moving target: If the task definition changes frequently, fine-tuning creates maintenance debt
  • Hallucination prevention: Fine-tuning does not reliably prevent hallucination; use RAG + grounding prompts

04-training-data | 06-practice-exercises