Chunking Strategies¶
Chunking is the most underrated part of RAG. The right chunk size and strategy determines whether your retrieval system finds the right information. The wrong strategy makes even perfect embeddings useless.
Learning objectives¶
- Implement fixed-size, sentence, recursive, semantic, and hierarchical chunking
- Measure chunk quality with a token counter
- Choose a strategy based on document type and query pattern
- Handle edge cases: very short documents, tables, code blocks
Fixed-size chunking¶
The simplest approach: split by character or token count with overlap.
import os
import re
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
def count_tokens(text: str) -> int:
return len(enc.encode(text))
def fixed_size_chunk(
text: str,
chunk_size: int = 512, # tokens
overlap: int = 64, # tokens
source: str = "doc"
) -> list[dict]:
tokens = enc.encode(text)
chunks = []
i = 0
while i < len(tokens):
chunk_tokens = tokens[i:i + chunk_size]
chunk_text = enc.decode(chunk_tokens)
chunks.append({
"id": f"{source}-{len(chunks)}",
"text": chunk_text.strip(),
"tokens": len(chunk_tokens),
"source": source,
"strategy": "fixed"
})
i += chunk_size - overlap
return chunks
sample = """
Artificial intelligence is transforming the way businesses operate. Machine learning,
a core branch of AI, enables systems to learn and improve from experience without being
explicitly programmed. Deep learning, a subset of machine learning, uses neural networks
with many layers to model complex patterns in data.
Natural language processing (NLP) allows computers to understand, interpret, and generate
human language. Applications include sentiment analysis, machine translation, and chatbots.
Large language models like GPT-4o and Claude have pushed NLP capabilities dramatically forward.
Computer vision enables machines to interpret visual information from the world. Convolutional
neural networks (CNNs) excel at image classification, object detection, and segmentation.
Diffusion models now power state-of-the-art image generation.
""".strip()
chunks = fixed_size_chunk(sample, chunk_size=80, overlap=10)
for c in chunks:
print(f"Chunk {c['id']}: {c['tokens']} tokens — {c['text'][:60]}...")
Best for: Generic documents, quick prototyping, when you don't know your document structure. Avoid for: Documents with clear semantic boundaries (chapters, functions, paragraphs).
Sentence-level chunking¶
Respects sentence boundaries — no chunk ends mid-sentence.
import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import sent_tokenize
def sentence_chunk(
text: str,
max_tokens: int = 256,
overlap_sentences: int = 1,
source: str = "doc"
) -> list[dict]:
sentences = sent_tokenize(text)
chunks = []
current_sentences = []
current_tokens = 0
for i, sent in enumerate(sentences):
sent_tokens = count_tokens(sent)
if current_tokens + sent_tokens > max_tokens and current_sentences:
chunk_text = " ".join(current_sentences)
chunks.append({
"id": f"{source}-{len(chunks)}",
"text": chunk_text.strip(),
"tokens": count_tokens(chunk_text),
"source": source,
"strategy": "sentence"
})
# Keep overlap sentences
current_sentences = current_sentences[-overlap_sentences:]
current_tokens = sum(count_tokens(s) for s in current_sentences)
current_sentences.append(sent)
current_tokens += sent_tokens
if current_sentences:
chunk_text = " ".join(current_sentences)
chunks.append({
"id": f"{source}-{len(chunks)}",
"text": chunk_text.strip(),
"tokens": count_tokens(chunk_text),
"source": source,
"strategy": "sentence"
})
return chunks
chunks = sentence_chunk(sample, max_tokens=80, overlap_sentences=1)
for c in chunks:
print(f"Chunk {c['id']}: {c['tokens']} tokens — {c['text'][:70]}...")
Best for: Articles, documentation, prose documents. Avoid for: Code, structured data, documents with very long sentences.
Recursive character chunking¶
Tries multiple separators in order, falling back to character splitting only when needed. This is what LangChain's RecursiveCharacterTextSplitter implements.
def recursive_chunk(
text: str,
chunk_size: int = 500, # characters
overlap: int = 50,
separators: list[str] | None = None,
source: str = "doc"
) -> list[dict]:
if separators is None:
separators = ["\n\n", "\n", ". ", " ", ""]
def _split(text: str, sep_index: int) -> list[str]:
if sep_index >= len(separators):
return [text]
separator = separators[sep_index]
parts = text.split(separator) if separator else list(text)
parts = [p for p in parts if p.strip()]
chunks = []
current = ""
for part in parts:
if len(current) + len(separator) + len(part) <= chunk_size:
current = (current + separator + part).strip() if current else part
else:
if current:
chunks.append(current)
if len(part) > chunk_size:
# Part itself is too long — recurse with next separator
chunks.extend(_split(part, sep_index + 1))
current = ""
else:
current = part
if current:
chunks.append(current)
return chunks
raw_chunks = _split(text, 0)
result = []
for i, chunk_text in enumerate(raw_chunks):
if chunk_text.strip():
result.append({
"id": f"{source}-{i}",
"text": chunk_text.strip(),
"tokens": count_tokens(chunk_text),
"source": source,
"strategy": "recursive"
})
return result
chunks = recursive_chunk(sample, chunk_size=300, overlap=30)
for c in chunks:
print(f"Chunk {c['id']}: {c['tokens']} tokens — {c['text'][:60]}...")
Best for: Mixed documents, markdown, code + prose. Avoid for: Documents where you need full semantic preservation.
Semantic chunking¶
Groups sentences together based on embedding similarity — a new paragraph starts when the topic shifts significantly.
from openai import OpenAI
import numpy as np
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def cosine_sim(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))
def semantic_chunk(
text: str,
breakpoint_threshold: float = 0.3, # drop in similarity triggers new chunk
min_chunk_tokens: int = 50,
source: str = "doc"
) -> list[dict]:
sentences = sent_tokenize(text)
if len(sentences) < 2:
return [{"id": f"{source}-0", "text": text, "tokens": count_tokens(text), "source": source, "strategy": "semantic"}]
# Embed each sentence
resp = client.embeddings.create(model="text-embedding-3-small", input=sentences)
embeddings = [item.embedding for item in resp.data]
# Compute similarity between consecutive sentences
similarities = [cosine_sim(embeddings[i], embeddings[i+1]) for i in range(len(embeddings)-1)]
# Find breakpoints where similarity drops sharply
avg_sim = np.mean(similarities)
breakpoints = {i+1 for i, sim in enumerate(similarities) if sim < avg_sim - breakpoint_threshold}
# Build chunks
chunks = []
current_sentences = []
current_tokens = 0
for i, sent in enumerate(sentences):
if i in breakpoints and current_tokens >= min_chunk_tokens and current_sentences:
chunks.append({
"id": f"{source}-{len(chunks)}",
"text": " ".join(current_sentences),
"tokens": current_tokens,
"source": source,
"strategy": "semantic"
})
current_sentences = []
current_tokens = 0
current_sentences.append(sent)
current_tokens += count_tokens(sent)
if current_sentences:
chunks.append({
"id": f"{source}-{len(chunks)}",
"text": " ".join(current_sentences),
"tokens": current_tokens,
"source": source,
"strategy": "semantic"
})
return chunks
# chunks = semantic_chunk(sample)
# for c in chunks:
# print(f"Chunk {c['id']}: {c['tokens']} tokens — {c['text'][:70]}...")
Best for: Long documents with clear topic transitions (research papers, book chapters). Cost: Requires embedding every sentence (more API calls). Use only when chunk quality is critical.
Hierarchical (parent-child) chunking¶
Store large parent chunks for context and small child chunks for retrieval. Retrieve by child, return the parent.
def hierarchical_chunk(
text: str,
parent_size: int = 1000,
child_size: int = 200,
child_overlap: int = 20,
source: str = "doc"
) -> tuple[list[dict], list[dict]]:
parents = fixed_size_chunk(text, chunk_size=parent_size, overlap=0, source=source)
children = []
for parent in parents:
child_chunks = fixed_size_chunk(
parent["text"],
chunk_size=child_size,
overlap=child_overlap,
source=f"{parent['id']}"
)
for child in child_chunks:
child["parent_id"] = parent["id"]
child["parent_text"] = parent["text"]
children.extend(child_chunks)
return parents, children
parents, children = hierarchical_chunk(sample, parent_size=150, child_size=50, child_overlap=5)
print(f"Parents: {len(parents)}, Children: {len(children)}")
for child in children[:3]:
print(f" Child {child['id']}: '{child['text'][:50]}...' → Parent: {child['parent_id']}")
# At search time: embed + index children, retrieve children, return parent_text in prompt
Best for: When you want fine-grained retrieval but need enough context for generation. How it works: Embed and search the small child chunks (high precision). When a child is retrieved, include the full parent chunk in the generation context (more context for the LLM).
Chunking strategy comparison¶
| Strategy | Chunk boundaries | Semantic coherence | Cost | Best for |
|---|---|---|---|---|
| Fixed-size | Token count | Low | Very low | Prototyping |
| Sentence | Sentence end | Medium | Low | Prose documents |
| Recursive | Hierarchical separators | Medium-high | Low | Mixed content, markdown |
| Semantic | Topic shifts | High | High (API calls) | Research papers, books |
| Hierarchical | Multi-level | Highest | Medium | Long documents needing full-context generation |
Start with recursive chunking
For most documents, recursive chunking with chunk_size=512 and overlap=64 tokens is the right starting point. It's cheap, fast, and respects natural document structure. Only switch to semantic chunking when you measure poor retrieval quality on recursive chunks.
Handling special content¶
def chunk_with_metadata(
text: str,
source: str,
doc_type: str = "text"
) -> list[dict]:
# Code blocks: chunk by function/class, not by token count
if doc_type == "code":
# Split on function/class boundaries
blocks = re.split(r'\n(?=def |class )', text)
return [
{"id": f"{source}-{i}", "text": block.strip(), "tokens": count_tokens(block),
"source": source, "doc_type": "code", "strategy": "code_boundary"}
for i, block in enumerate(blocks) if block.strip()
]
# Markdown: respect heading boundaries
if doc_type == "markdown":
sections = re.split(r'\n(?=#{1,3} )', text)
return [
{"id": f"{source}-{i}", "text": section.strip(), "tokens": count_tokens(section),
"source": source, "doc_type": "markdown", "strategy": "heading"}
for i, section in enumerate(sections) if section.strip()
]
# Default: recursive character splitting
return recursive_chunk(text, source=source)