Zero-Shot and Few-Shot Prompting¶
The first question when building any LLM feature is: how much does the model already know? Zero-shot measures the baseline. Few-shot gives the model examples to calibrate from. Getting this right is cheaper and faster than any other technique.
Learning objectives¶
- Write effective zero-shot prompts using clarity, specificity, and output format instructions
- Design few-shot examples that generalize rather than overfit
- Know when to add examples and when they're unnecessary overhead
- Measure whether examples actually help using a quick evaluation
Zero-shot prompting¶
Zero-shot means no examples — just the instruction. Modern frontier models (GPT-4o, Claude Sonnet 4.6) handle a remarkable range of tasks zero-shot.
The key to effective zero-shot prompting is specificity:
import openai
client = openai.OpenAI()
def prompt(user_message: str, system: str = "", temperature: float = 0.0) -> str:
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=temperature,
max_tokens=512,
)
return response.choices[0].message.content
# BAD — vague instruction
bad = prompt("Tell me about the customer review: 'Great product, fast shipping!'")
print("Bad:", bad)
# Could produce anything: summary, analysis, praise, critique...
# GOOD — specific instruction with output format
good = prompt("""Classify the sentiment of this customer review as POSITIVE, NEGATIVE, or NEUTRAL.
Respond with only the label, nothing else.
Review: 'Great product, fast shipping!'""")
print("Good:", good)
# Output: POSITIVE
The specificity ladder¶
Vague → precise prompts:
| Version | Prompt | Problem |
|---|---|---|
| Vague | "Summarize this email." | Unclear length, format, audience |
| Better | "Summarize this email in 2–3 sentences." | Length specified |
| Good | "Summarize this email in 2–3 sentences for a busy executive who needs to know: action required, deadline, and key stakeholders." | Purpose + content requirements specified |
| Excellent | Add: "Use plain language. No jargon. Start with the action." | Style constraints added |
Anatomy of a strong zero-shot prompt¶
def classify_support_ticket(ticket_text: str) -> str:
"""
Zero-shot classifier using all five components of a strong prompt.
"""
prompt_text = f"""You are a customer support routing system.
## Task
Classify the support ticket below into exactly one category.
## Categories
- BILLING: Payment issues, refunds, invoices, subscription changes
- TECHNICAL: Bugs, errors, crashes, performance issues, integrations
- ACCOUNT: Login, password, permissions, account settings
- FEATURE_REQUEST: Suggestions for new features or improvements
- OTHER: Anything that doesn't fit the above categories
## Rules
- Respond with ONLY the category label (e.g., BILLING)
- Do not explain your reasoning
- If multiple categories apply, pick the most specific one
## Ticket
{ticket_text}
## Category"""
return prompt(prompt_text)
# Test
tickets = [
"I was charged twice for my subscription last month",
"The export to CSV button throws a 500 error",
"Can you add dark mode to the dashboard?",
"I can't log in after resetting my password",
]
for ticket in tickets:
label = classify_support_ticket(ticket)
print(f" [{label}] {ticket[:60]}")
The five components of a strong zero-shot prompt:
1. Role / context — who the model is
2. Task — what to do, precisely
3. Constraints — format, length, style, what NOT to do
4. Input — the actual content to process
5. Output anchor — start the output structure (e.g., ## Category)
Few-shot prompting¶
Few-shot provides worked examples inside the prompt. The model uses them as templates to calibrate format, tone, and reasoning style.
import anthropic
client = anthropic.Anthropic()
def extract_structured_data(text: str) -> str:
"""
Few-shot extraction — examples teach the output format.
"""
system = "You extract structured information from text. Return only the requested data."
few_shot_prompt = f"""Extract the person's name, company, and role from the text.
Format: Name | Company | Role
Text: "Hi, I'm Sarah Chen, Senior Product Manager at DataFlow Inc."
Extract: Sarah Chen | DataFlow Inc. | Senior Product Manager
Text: "This is Mike Rodriguez from the engineering team at CloudBase."
Extract: Mike Rodriguez | CloudBase | Unknown
Text: "Dr. Emily Watson, Chief Medical Officer, HealthPath Systems."
Extract: Emily Watson | HealthPath Systems | Chief Medical Officer
Text: "{text}"
Extract:"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
system=system,
messages=[{"role": "user", "content": few_shot_prompt}],
)
return response.content[0].text.strip()
test_cases = [
"I'm Alex Kim, the lead data scientist at NeuralLabs.",
"Please connect me with James Thompson, he's a partner at Summit Ventures.",
"Our CTO, Lisa Park from Quantix, will join the call.",
]
for text in test_cases:
result = extract_structured_data(text)
print(f"Input: {text}")
print(f"Output: {result}\n")
Writing good few-shot examples¶
Bad examples hurt more than no examples. Follow these rules:
Rule 1: Cover the edge cases¶
# BAD: all easy cases — model won't know how to handle edges
bad_examples = [
("I love this product!", "POSITIVE"),
("This is terrible.", "NEGATIVE"),
("It's okay.", "NEUTRAL"),
]
# GOOD: include the hard cases
good_examples = [
("I love this product!", "POSITIVE"),
("This is terrible.", "NEGATIVE"),
("It's okay.", "NEUTRAL"),
("The product is great but shipping took forever.", "MIXED"), # edge case
("I returned it.", "NEUTRAL"), # no sentiment expressed
("Don't buy this. Seriously.", "NEGATIVE"), # sarcasm-adjacent
]
Rule 2: Match the distribution of real inputs¶
def build_few_shot_prompt(examples: list[tuple[str, str]], new_input: str) -> str:
"""Build a few-shot prompt from (input, output) pairs."""
lines = []
for inp, out in examples:
lines.append(f"Input: {inp}")
lines.append(f"Output: {out}")
lines.append("") # blank line separator
lines.append(f"Input: {new_input}")
lines.append("Output:")
return "\n".join(lines)
# The number of examples: use 3–8. Diminishing returns after 8 for most tasks.
# Research shows: quality of examples >> quantity of examples
Rule 3: Put the most relevant example last¶
The final example before the new input has the strongest priming effect. Put your best or most representative example there.
Zero-shot vs few-shot: when to use which¶
| Situation | Use |
|---|---|
| Standard NLP task (sentiment, classification) | Zero-shot — frontier models are pre-trained on these |
| Custom output format (your company's JSON schema) | Few-shot — the model hasn't seen your format |
| Complex multi-field extraction | Few-shot — 2–3 examples clarify expectations |
| Creative generation with specific style | Few-shot — 1–2 samples of your desired style |
| Simple Q&A, summarization | Zero-shot — examples rarely help |
| Task with many edge cases | Few-shot with curated edge case examples |
Calibration test
Add one example at a time and measure output quality. If adding the third example doesn't improve results, stop. Each example costs tokens.
Dynamic few-shot selection¶
Static examples embedded in a prompt are easy. Dynamic selection — retrieving the most relevant examples for each input — is more powerful and is covered in the RAG section.
from openai import OpenAI
import numpy as np
client = OpenAI()
# Embed examples and query to find most similar ones
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def select_examples(
example_pool: list[tuple[str, str]],
query: str,
k: int = 3,
) -> list[tuple[str, str]]:
"""Return the k most similar examples to the query."""
query_emb = get_embedding(query)
scored = [
(cosine_similarity(query_emb, get_embedding(inp)), inp, out)
for inp, out in example_pool
]
scored.sort(reverse=True)
return [(inp, out) for _, inp, out in scored[:k]]
# In practice: pre-compute and store example embeddings in a vector DB
# See: Week-01/Day-03-Part-2-Vector-Databases
Common mistakes¶
Inconsistent examples confuse the model
If your examples use different formats — sometimes JSON, sometimes plain text — the model will pick one unpredictably. Use exactly one format across all examples.
Example labels that are wrong
A mislabeled example is worse than no example. The model will learn from it. Curate your examples carefully — treat them like training data.
Too many examples = hidden context cost
10 examples × 200 tokens each = 2,000 tokens of prompt. For a high-volume classification task making 100,000 calls/day, that's 200M extra input tokens ≈ $500/day at GPT-4o pricing.
Key takeaway
Start with zero-shot. Add examples only when zero-shot fails, and test empirically that they help. The quality of your examples matters more than the quantity.