How LLMs Generate Text¶
LLMs don't "think of" the next word. They compute a probability distribution over the entire vocabulary and then sample from it. The sampling strategy you choose shapes everything from creativity to factual accuracy.
Learning objectives¶
- Explain autoregressive generation and why it's sequential
- Compare greedy, temperature, top-k, and top-p sampling
- Set
temperature,top_p, andmax_tokenscorrectly for a given task - Explain why reasoning models (o3, o4-mini) behave differently
Autoregressive generation¶
LLMs generate text one token at a time, left to right. Each new token is conditioned on all previous tokens:
The model cannot "go back" and change an earlier token. This is why getting the model to reason before answering — via chain-of-thought — improves output quality: it commits earlier tokens to a reasoning path that constrains and improves later tokens.
import openai
client = openai.OpenAI()
# Each call to the API generates one "response" which internally
# generates tokens one at a time via autoregressive decoding
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "The capital of France is"}],
max_tokens=5,
temperature=0, # greedy — always picks the highest-probability token
)
print(response.choices[0].message.content)
# Output: " Paris." (deterministic with temperature=0)
The logits-to-token pipeline¶
After the transformer's final layer, the model produces a logit vector — one raw score per vocabulary token (~200,000 scores for GPT-4o):
[final transformer layer output]
↓
[Linear projection] → 200,000 raw logit scores
↓
[Apply temperature] → scale logits
↓
[Apply top-k / top-p filtering] → zero out low-probability tokens
↓
[Softmax] → probability distribution
↓
[Sample] → one token index
↓
[Decode] → "Paris"
Greedy decoding¶
Always pick the highest-probability token. Deterministic but boring.
import torch
import torch.nn.functional as F
def greedy_decode(logits: torch.Tensor) -> int:
return logits.argmax().item()
# The problem with greedy: it gets stuck in repetitive loops
# "The cat sat on the mat. The cat sat on the mat. The cat..."
# Use temperature=0 in APIs for near-greedy behavior
When to use: Extraction tasks, structured output, or when you need determinism in tests.
Temperature¶
Temperature scales the logits before softmax, controlling how peaked or flat the distribution is.
def apply_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
if temperature == 0:
# Effectively greedy — return a one-hot distribution
probs = torch.zeros_like(logits)
probs[logits.argmax()] = 1.0
return probs
return F.softmax(logits / temperature, dim=-1)
# Demonstration
logits = torch.tensor([3.0, 1.5, 0.5, 0.2]) # 4-token vocab
for temp in [0.1, 0.5, 1.0, 1.5, 2.0]:
probs = apply_temperature(logits, temp)
print(f"T={temp}: {probs.tolist()}")
# T=0.1: [0.98, 0.02, 0.00, 0.00] ← very peaked, near-deterministic
# T=1.0: [0.67, 0.22, 0.08, 0.03] ← original distribution
# T=2.0: [0.40, 0.28, 0.19, 0.14] ← much flatter, more random
Temperature guide by task:
| Task | Recommended Temperature |
|---|---|
| Fact extraction, classification | 0.0 – 0.2 |
| Summarization, Q&A | 0.3 – 0.5 |
| Code generation | 0.2 – 0.4 |
| Writing, brainstorming | 0.7 – 1.0 |
| Creative fiction, poetry | 1.0 – 1.2 |
| Avoid anything above 1.5 | Output degrades rapidly |
Top-k sampling¶
Restrict sampling to only the top k tokens. Prevents the model from ever picking a very improbable token.
def top_k_filter(logits: torch.Tensor, k: int) -> torch.Tensor:
"""Zero out all logits except the top k."""
top_k_vals, _ = torch.topk(logits, k)
threshold = top_k_vals[-1]
filtered = logits.clone()
filtered[filtered < threshold] = float("-inf")
return F.softmax(filtered, dim=-1)
logits = torch.tensor([3.0, 1.5, 0.5, 0.2, -0.5, -2.0])
probs_k10 = top_k_filter(logits, k=3)
print(probs_k10) # Only top 3 tokens have non-zero probability
Problem with top-k: k=50 is too wide when the distribution is sharp (one obvious token), and too narrow when the distribution is flat (many reasonable tokens).
Top-p (nucleus) sampling¶
Instead of a fixed count k, take the smallest set of tokens whose cumulative probability exceeds p. Adapts to the distribution shape.
def top_p_filter(logits: torch.Tensor, p: float) -> torch.Tensor:
"""Keep only the top tokens whose cumulative probability >= p."""
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens above the threshold (but keep the first one that crosses)
sorted_indices_to_remove = cumulative_probs > p
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = False
indices_to_remove = sorted_indices_to_remove.scatter(
0, sorted_indices, sorted_indices_to_remove
)
filtered_logits = logits.masked_fill(indices_to_remove, float("-inf"))
return F.softmax(filtered_logits, dim=-1)
# When the model is very confident: top-p=0.9 picks just 1-2 tokens
# When the model is uncertain: top-p=0.9 picks 10-20 tokens
# This adaptive behavior is why top-p outperforms top-k in practice
In practice: Set top_p=1.0 (effectively disabled) and use only temperature, OR set temperature=1.0 and use only top-p. Combining both often causes unexpected interactions.
Using sampling parameters in the API¶
import openai, anthropic
openai_client = openai.OpenAI()
anthropic_client = anthropic.Anthropic()
# Creative writing — high temperature
creative_response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write the opening line of a noir novel."}],
temperature=1.0,
max_tokens=100,
# top_p=1.0, # leave at default when using temperature
)
# Data extraction — near-greedy
extraction_response = anthropic_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
temperature=0.0, # deterministic
messages=[{
"role": "user",
"content": 'Extract the invoice number from: "Invoice #INV-2024-8821 dated March 15"',
}],
)
print("Creative:", creative_response.choices[0].message.content)
print("Extraction:", extraction_response.content[0].text)
Stop sequences¶
Stop sequences tell the model to stop generating when it produces a specific string. Crucial for structured output and preventing runaway responses.
# Stop before the model starts a new "User:" turn in simulated dialogue
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Simulate a support conversation."}],
stop=["User:", "Human:", "\n\n---"],
max_tokens=500,
)
# Stop after JSON object closes
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": 'Return JSON: {"name": "Alice", "age":'}],
stop=["}"],
max_tokens=20,
)
# Output will be: ' 30' — the model stops before producing the closing brace
max_tokens is not optional
Always set max_tokens. Without it, the model can generate until the context window is full — costing you money and delivering a garbage response. For most tasks: 256–2048. For summaries of long docs: 1024–4096.
Reasoning models: a different paradigm¶
OpenAI's o-series (o1, o3, o4-mini) and Anthropic's extended thinking mode work differently. They generate a hidden chain-of-thought before producing the visible answer.
# o4-mini uses a "reasoning effort" parameter instead of temperature
response = openai_client.chat.completions.create(
model="o4-mini",
messages=[{"role": "user", "content": "What is 17 × 23 + sqrt(289)?"}],
reasoning_effort="high", # "low", "medium", or "high"
max_completion_tokens=2000, # includes reasoning tokens
)
# Note: temperature and top_p are NOT used with o-series models
print(response.choices[0].message.content)
# Anthropic extended thinking
response = anthropic_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000, # how many tokens the model can spend reasoning
},
messages=[{"role": "user", "content": "Solve: If a snail crawls 3cm/hour..."}],
)
for block in response.content:
if block.type == "thinking":
print("REASONING:", block.thinking[:200], "...")
else:
print("ANSWER:", block.text)
When to use reasoning models
Use o3/o4-mini or extended thinking for: multi-step math, complex code generation, tasks where being wrong is expensive. Don't use them for: simple Q&A, quick classification, tasks that need low latency. They're 5–20× slower and more expensive.
Beam search (not used in modern LLM APIs)¶
Beam search maintains the top B candidate sequences simultaneously. It was standard in translation models but is not used in GPT-4o, Claude, or other chat models:
- Produces repetitive, generic output
- Computationally expensive at scale
- Incompatible with streaming
Beam search still appears in specialized models (like Whisper for speech-to-text). For chat and generation, sampling always wins.
Key takeaway
For extraction and classification: temperature=0. For generation and creativity: temperature=0.7–1.0. Always set max_tokens. Use reasoning models when accuracy matters more than speed.