Skip to content

How LLMs Generate Text

LLMs don't "think of" the next word. They compute a probability distribution over the entire vocabulary and then sample from it. The sampling strategy you choose shapes everything from creativity to factual accuracy.

Learning objectives

  • Explain autoregressive generation and why it's sequential
  • Compare greedy, temperature, top-k, and top-p sampling
  • Set temperature, top_p, and max_tokens correctly for a given task
  • Explain why reasoning models (o3, o4-mini) behave differently

Autoregressive generation

LLMs generate text one token at a time, left to right. Each new token is conditioned on all previous tokens:

P(token_5 | token_1, token_2, token_3, token_4)

The model cannot "go back" and change an earlier token. This is why getting the model to reason before answering — via chain-of-thought — improves output quality: it commits earlier tokens to a reasoning path that constrains and improves later tokens.

import openai

client = openai.OpenAI()

# Each call to the API generates one "response" which internally
# generates tokens one at a time via autoregressive decoding
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "The capital of France is"}],
    max_tokens=5,
    temperature=0,  # greedy — always picks the highest-probability token
)
print(response.choices[0].message.content)
# Output: " Paris." (deterministic with temperature=0)

The logits-to-token pipeline

After the transformer's final layer, the model produces a logit vector — one raw score per vocabulary token (~200,000 scores for GPT-4o):

[final transformer layer output]
[Linear projection] → 200,000 raw logit scores
[Apply temperature] → scale logits
[Apply top-k / top-p filtering] → zero out low-probability tokens
[Softmax] → probability distribution
[Sample] → one token index
[Decode] → "Paris"

Greedy decoding

Always pick the highest-probability token. Deterministic but boring.

import torch
import torch.nn.functional as F

def greedy_decode(logits: torch.Tensor) -> int:
    return logits.argmax().item()

# The problem with greedy: it gets stuck in repetitive loops
# "The cat sat on the mat. The cat sat on the mat. The cat..."
# Use temperature=0 in APIs for near-greedy behavior

When to use: Extraction tasks, structured output, or when you need determinism in tests.


Temperature

Temperature scales the logits before softmax, controlling how peaked or flat the distribution is.

def apply_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
    if temperature == 0:
        # Effectively greedy — return a one-hot distribution
        probs = torch.zeros_like(logits)
        probs[logits.argmax()] = 1.0
        return probs
    return F.softmax(logits / temperature, dim=-1)

# Demonstration
logits = torch.tensor([3.0, 1.5, 0.5, 0.2])  # 4-token vocab

for temp in [0.1, 0.5, 1.0, 1.5, 2.0]:
    probs = apply_temperature(logits, temp)
    print(f"T={temp}: {probs.tolist()}")

# T=0.1:  [0.98, 0.02, 0.00, 0.00]  ← very peaked, near-deterministic
# T=1.0:  [0.67, 0.22, 0.08, 0.03]  ← original distribution
# T=2.0:  [0.40, 0.28, 0.19, 0.14]  ← much flatter, more random

Temperature guide by task:

Task Recommended Temperature
Fact extraction, classification 0.0 – 0.2
Summarization, Q&A 0.3 – 0.5
Code generation 0.2 – 0.4
Writing, brainstorming 0.7 – 1.0
Creative fiction, poetry 1.0 – 1.2
Avoid anything above 1.5 Output degrades rapidly

Top-k sampling

Restrict sampling to only the top k tokens. Prevents the model from ever picking a very improbable token.

def top_k_filter(logits: torch.Tensor, k: int) -> torch.Tensor:
    """Zero out all logits except the top k."""
    top_k_vals, _ = torch.topk(logits, k)
    threshold = top_k_vals[-1]
    filtered = logits.clone()
    filtered[filtered < threshold] = float("-inf")
    return F.softmax(filtered, dim=-1)

logits = torch.tensor([3.0, 1.5, 0.5, 0.2, -0.5, -2.0])
probs_k10 = top_k_filter(logits, k=3)
print(probs_k10)  # Only top 3 tokens have non-zero probability

Problem with top-k: k=50 is too wide when the distribution is sharp (one obvious token), and too narrow when the distribution is flat (many reasonable tokens).


Top-p (nucleus) sampling

Instead of a fixed count k, take the smallest set of tokens whose cumulative probability exceeds p. Adapts to the distribution shape.

def top_p_filter(logits: torch.Tensor, p: float) -> torch.Tensor:
    """Keep only the top tokens whose cumulative probability >= p."""
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Remove tokens above the threshold (but keep the first one that crosses)
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    indices_to_remove = sorted_indices_to_remove.scatter(
        0, sorted_indices, sorted_indices_to_remove
    )
    filtered_logits = logits.masked_fill(indices_to_remove, float("-inf"))
    return F.softmax(filtered_logits, dim=-1)

# When the model is very confident: top-p=0.9 picks just 1-2 tokens
# When the model is uncertain: top-p=0.9 picks 10-20 tokens
# This adaptive behavior is why top-p outperforms top-k in practice

In practice: Set top_p=1.0 (effectively disabled) and use only temperature, OR set temperature=1.0 and use only top-p. Combining both often causes unexpected interactions.


Using sampling parameters in the API

import openai, anthropic

openai_client = openai.OpenAI()
anthropic_client = anthropic.Anthropic()

# Creative writing — high temperature
creative_response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write the opening line of a noir novel."}],
    temperature=1.0,
    max_tokens=100,
    # top_p=1.0,  # leave at default when using temperature
)

# Data extraction — near-greedy
extraction_response = anthropic_client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    temperature=0.0,  # deterministic
    messages=[{
        "role": "user",
        "content": 'Extract the invoice number from: "Invoice #INV-2024-8821 dated March 15"',
    }],
)

print("Creative:", creative_response.choices[0].message.content)
print("Extraction:", extraction_response.content[0].text)

Stop sequences

Stop sequences tell the model to stop generating when it produces a specific string. Crucial for structured output and preventing runaway responses.

# Stop before the model starts a new "User:" turn in simulated dialogue
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Simulate a support conversation."}],
    stop=["User:", "Human:", "\n\n---"],
    max_tokens=500,
)

# Stop after JSON object closes
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": 'Return JSON: {"name": "Alice", "age":'}],
    stop=["}"],
    max_tokens=20,
)
# Output will be: ' 30' — the model stops before producing the closing brace

max_tokens is not optional

Always set max_tokens. Without it, the model can generate until the context window is full — costing you money and delivering a garbage response. For most tasks: 256–2048. For summaries of long docs: 1024–4096.


Reasoning models: a different paradigm

OpenAI's o-series (o1, o3, o4-mini) and Anthropic's extended thinking mode work differently. They generate a hidden chain-of-thought before producing the visible answer.

# o4-mini uses a "reasoning effort" parameter instead of temperature
response = openai_client.chat.completions.create(
    model="o4-mini",
    messages=[{"role": "user", "content": "What is 17 × 23 + sqrt(289)?"}],
    reasoning_effort="high",  # "low", "medium", or "high"
    max_completion_tokens=2000,  # includes reasoning tokens
)
# Note: temperature and top_p are NOT used with o-series models
print(response.choices[0].message.content)

# Anthropic extended thinking
response = anthropic_client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,  # how many tokens the model can spend reasoning
    },
    messages=[{"role": "user", "content": "Solve: If a snail crawls 3cm/hour..."}],
)
for block in response.content:
    if block.type == "thinking":
        print("REASONING:", block.thinking[:200], "...")
    else:
        print("ANSWER:", block.text)

When to use reasoning models

Use o3/o4-mini or extended thinking for: multi-step math, complex code generation, tasks where being wrong is expensive. Don't use them for: simple Q&A, quick classification, tasks that need low latency. They're 5–20× slower and more expensive.


Beam search (not used in modern LLM APIs)

Beam search maintains the top B candidate sequences simultaneously. It was standard in translation models but is not used in GPT-4o, Claude, or other chat models:

  • Produces repetitive, generic output
  • Computationally expensive at scale
  • Incompatible with streaming

Beam search still appears in specialized models (like Whisper for speech-to-text). For chat and generation, sampling always wins.


Key takeaway

For extraction and classification: temperature=0. For generation and creativity: temperature=0.7–1.0. Always set max_tokens. Use reasoning models when accuracy matters more than speed.

03-context-windows | 05-practice-exercises