Context Windows¶
The context window is the model's working memory — everything it can "see" at once. Get this wrong and your RAG pipeline hallucination rate spikes. Get it right and you can process entire codebases or legal documents in a single call.
Learning objectives¶
- Define context window and explain what goes inside it
- Compare current model context limits and their practical implications
- Identify the "lost in the middle" problem and when it matters
- Design prompts and retrieval pipelines that work within context limits
What is a context window?¶
The context window is the maximum number of tokens a model can process in a single forward pass — input tokens + output tokens combined.
┌──────────────────────────────────────────────────────┐
│ CONTEXT WINDOW │
│ │
│ [System Prompt] [Chat History] [Retrieved Docs] │
│ [User Message] ←── input tokens ──────────────► │
│ │
│ [Model Response] ←── output tokens ───────────► │
└──────────────────────────────────────────────────────┘
Everything in the window is "visible" to every attention head in every layer. Nothing outside the window exists for the model. It has no implicit memory between API calls.
2025 context window landscape¶
| Model | Context Window | Max Output | Notes |
|---|---|---|---|
| GPT-4o | 128,000 tokens | 16,384 | ~100 pages of text |
| GPT-4o-mini | 128,000 tokens | 16,384 | Same window, lower cost |
| o3 | 200,000 tokens | 100,000 | Chain-of-thought output can be large |
| o4-mini | 200,000 tokens | 100,000 | |
| Claude Sonnet 4.6 | 1,000,000 tokens | 64,000 | ~750 pages of text |
| Claude Opus 4.7 | 1,000,000 tokens | 64,000 | |
| Gemini 3 Pro | 1,000,000 tokens | 65,000 | |
| Llama 3.3 70B | 128,000 tokens | 8,192 | Open-weight |
| Mistral Large | 128,000 tokens | — |
As of May 2026. The context window race has largely settled — 1M tokens is the current ceiling for frontier models.
Tokens → pages rough conversion
1,000 tokens ≈ 750 words ≈ 1.5 pages of English text. So:
- 128K tokens ≈ 100 pages
- 1M tokens ≈ 750 pages (a thick novel, or a large codebase)
What goes inside the context window¶
In a typical RAG pipeline the context fills up fast:
import tiktoken
def audit_context_usage(
system_prompt: str,
chat_history: list[dict],
retrieved_chunks: list[str],
user_query: str,
model: str = "gpt-4o",
) -> dict:
"""Break down token usage across all context components."""
enc = tiktoken.get_encoding("o200k_base")
def count(text: str) -> int:
return len(enc.encode(text))
history_text = " ".join(m["content"] for m in chat_history)
chunks_text = "\n\n".join(retrieved_chunks)
breakdown = {
"system_prompt": count(system_prompt),
"chat_history": count(history_text),
"retrieved_chunks": count(chunks_text),
"user_query": count(user_query),
}
breakdown["total_input"] = sum(breakdown.values())
breakdown["remaining"] = 128_000 - breakdown["total_input"] # GPT-4o
return breakdown
# Example
result = audit_context_usage(
system_prompt="You are a helpful assistant that answers questions from documents.",
chat_history=[
{"role": "user", "content": "What's the refund policy?"},
{"role": "assistant", "content": "According to the document..."},
],
retrieved_chunks=["Section 3.2: Refunds are processed within 14 days..."] * 5,
user_query="Can I get a refund after 30 days?",
)
for key, val in result.items():
print(f"{key:20}: {val:6,} tokens")
The "lost in the middle" problem¶
Longer context sounds like a free lunch — just stuff everything in. But research shows model performance degrades on information placed in the middle of a long context.
Performance retrieving a fact from context:
Position: Beginning ████████████ 95%
Position: Middle ██████ 65%
Position: End ████████████ 92%
This is the "lost in the middle" effect (Liu et al., 2023). It persists even in 1M-token models.
Practical implication for RAG:
def reorder_chunks_for_attention(chunks: list[str], query: str) -> list[str]:
"""
Put the most relevant chunks at the beginning and end.
Bury less relevant chunks in the middle.
This is sometimes called 'lost in the middle' mitigation.
"""
# chunks already ranked by relevance score (highest first)
if len(chunks) <= 2:
return chunks
# Most relevant → first, second most → last, rest in middle
reordered = [chunks[0]] + chunks[2:] + [chunks[1]]
return reordered
# In production: combine this with a reranker
# See: Week-02/Day-01-Part-2-Advanced-RAG
Big context ≠ good retrieval
Sending 200 chunks to a 1M-token model is not better than sending the 5 most relevant chunks. The model's ability to precisely locate and use information degrades with context length. Use retrieval to narrow down, not to avoid retrieval.
Context window vs. KV cache¶
The KV cache is a performance optimization, not a conceptual extension of the context window.
During inference, computing attention requires the Key and Value matrices of every previous token. Rather than recomputing these on every new token, the model caches them in GPU memory.
Without KV cache:
Token 1000 generation: recompute K,V for tokens 1–999 → slow
With KV cache:
Token 1000 generation: read cached K,V for tokens 1–999 → fast
Memory cost: d_model × seq_len × num_layers × 2 (K + V) × bytes_per_param
For Llama 3 70B, 128K context: ~32 GB just for the KV cache
Prompt caching (Anthropic) and context caching (OpenAI)
Both providers now cache prefix tokens across API calls. If your system prompt is 1,000 tokens and you make 1,000 calls, you pay for those tokens once, not 1,000 times. Cache hits are ~90% cheaper. Design your system prompts to be stable prefixes.
# Anthropic prompt caching
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant.",
},
{
"type": "text",
"text": "LARGE_STABLE_DOCUMENT_HERE", # 10,000 tokens
"cache_control": {"type": "ephemeral"}, # cache this prefix
},
],
messages=[{"role": "user", "content": "Summarize section 3."}],
)
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
# Second call: cache_read_tokens = 10,000, cost reduced by ~90%
Practical guidelines¶
| Scenario | Recommendation |
|---|---|
| RAG with many chunks | Keep total retrieved context < 20% of context window. Leave room for reasoning. |
| Long document Q&A | Use Claude Sonnet 4.6 (1M context) for docs < 700 pages. Chunk + summarize for larger. |
| Multi-turn chat | Truncate history after N turns, or summarize old turns. Track tokens actively. |
| Code generation | Reserve 4,096+ output tokens. Code responses are long. |
| Reasoning models (o3, o4) | These use token budget for chain-of-thought. Set max_tokens accordingly. |
def sliding_window_history(
messages: list[dict],
max_history_tokens: int = 4000,
model: str = "gpt-4o",
) -> list[dict]:
"""Trim chat history to stay within a token budget."""
enc = tiktoken.get_encoding("o200k_base")
total = 0
trimmed = []
for msg in reversed(messages):
msg_tokens = len(enc.encode(msg["content"])) + 4
if total + msg_tokens > max_history_tokens:
break
trimmed.insert(0, msg)
total += msg_tokens
return trimmed
# Always keep the most recent messages, drop the oldest
messages = [{"role": "user" if i % 2 == 0 else "assistant", "content": f"Message {i} " * 50}
for i in range(20)]
trimmed = sliding_window_history(messages, max_history_tokens=2000)
print(f"Kept {len(trimmed)} of {len(messages)} messages")
Key takeaway
The context window is your most precious resource. Monitor it, budget it, and always know how many tokens your system prompt, retrieved chunks, and chat history consume before the user even types a word.