Text Splitters & Chunking Strategies¶
1. Why does this topic exist?¶
We have clean Documents from the loader. Why not just embed them whole?
Three hard limits forbid it:
-
Embedding models cap input length — typically 8K tokens (OpenAI
text-embedding-3-small), 32K at the largest. A 200-page PDF is ~80K tokens. It physically won't fit. -
One vector can't represent 50 pages. Embedding compresses meaning into ~1500 numbers. If you cram 50 pages worth of meaning in there, every concept blurs into average. Retrieval becomes "give me the doc most generally similar" — useless.
-
Retrieval granularity = answer quality. When user asks about a specific topic, you want the specific paragraph, not the whole book. Without chunking, you can only retrieve "books that mention X" — not "the paragraph about X".
Industry pain example: A startup embedded entire research papers as single vectors. Every query returned the same 3 papers (the broadest, most relevance-everything papers). After chunking into 800-char pieces, the same queries returned specific equations, specific figures, specific paragraphs. Recall@5 went from 0.3 → 0.91 overnight.
Why the old approach failed:
| Old approach | Problem |
|---|---|
| One vector per document | Loses granularity |
| Naive split every 1000 chars | Cuts mid-sentence, mid-equation |
| Manual section detection | Bespoke per format, brittle |
Chunking solves the granularity problem; smart chunking solves the boundary problem (don't cut in the middle of meaning).
2. What is it?¶
Simple explanation¶
A text splitter is a knife that knows where to cut. It breaks long documents into smaller, overlapping pieces — each small enough to embed, large enough to carry meaning.
Technical explanation¶
A Text Splitter is a class that takes a Document (or list) and returns a list of smaller Documents ("chunks"), governed by two parameters:
chunk_size: max characters/tokens per chunkchunk_overlap: characters shared between consecutive chunks
flowchart LR
A[Document<br/>10000 tokens] --> B[Splitter<br/>size=500 overlap=100]
B --> C[Chunk 1: tokens 0-500]
B --> D[Chunk 2: tokens 400-900]
B --> E[Chunk 3: tokens 800-1300]
B --> F[...]
The chunks preserve the parent document's metadata and add their own (chunk index, etc.).
Industry definition¶
The term "chunking" pre-dates RAG — it's from information retrieval going back to the 1970s. In the RAG era (2023+), seven strategies have emerged, each solving a different failure mode.
Mental model¶
Imagine slicing a long article into note cards. You want each card:
- Small enough to study quickly.
- Large enough to be self-contained (a single sentence isn't a card).
- Overlapping so an idea straddling two cards lives on both.
- Boundary-aware — never cut in the middle of a sentence or table.
That's the chunking job.
Analogy¶
A skilled chef cutting vegetables: uniform sizes (for even cooking), at natural joints (not through the center of a carrot), with enough overlap (small extras) to ensure nothing's wasted.
3. How does it work?¶
The two knobs every splitter has¶
| Knob | What it controls | Typical |
|---|---|---|
chunk_size |
Max chars/tokens per chunk | 500-1500 |
chunk_overlap |
Shared chars between adjacent chunks | 10-20% of chunk_size |
Why overlap matters¶
flowchart LR
A[Chunk A: ...the mitochondria is the powerhouse of] --> B[Chunk B: powerhouse of the cell. ATP synthesis...]
Without overlap, "powerhouse of the cell" gets split at "of" → neither chunk has the full phrase → retrieval misses queries about "powerhouse of the cell."
Token-vs-character¶
Most splitters count characters by default. For exact embedder-budget control, use TokenTextSplitter:
from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=300, chunk_overlap=50)
Rule of thumb: ~4 chars ≈ 1 token (English).
The 7 chunking strategies¶
We'll cover seven distinct strategies, from simplest to most sophisticated. Each solves a specific failure of the previous.
Strategy 1: Fixed-size chunking¶
The idea: Split every N characters. Period.
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(
separator="", # no special separator
chunk_size=500,
chunk_overlap=100,
)
chunks = splitter.split_documents(docs)
Diagram:
flowchart LR
A[Text: The quick brown fox jumps over the lazy dog. Mitochondria is...] --> B[Chunk 0-500]
A --> C[Chunk 400-900]
A --> D[Chunk 800-1300]
Pros: - Trivial implementation. - Predictable size = predictable cost.
Cons: - Cuts mid-sentence, mid-word, mid-equation. - Destroys semantic boundaries.
When to use: never, except for synthetic/uniform data (logs, structured records).
Failure case: Splitting "the temperature was 98.6°F" at position 50 yields "the temperature was 98." and "6°F". Embedder sees gibberish.
Strategy 2: Recursive chunking — the default¶
The idea: Try to split on natural separators in priority order — paragraph → line → sentence → word → character. Falls back to the next as needed.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(docs)
Diagram:
flowchart TD
A[Long text] --> B{Fits in chunk_size}
B -->|yes| KEEP[Keep as one chunk]
B -->|no| C[Split on paragraph \\n\\n]
C --> D{All pieces fit}
D -->|yes| OUT1[Done]
D -->|no| E[Split bigger pieces on line \\n]
E --> F{All fit}
F -->|yes| OUT2[Done]
F -->|no| G[Split on sentence . ]
G --> H[... recurse to word, then char]
Pros: - Respects natural boundaries when possible. - Falls back gracefully. - The right default for general prose.
Cons: - Doesn't understand semantics (just looks at characters). - May still split related ideas.
When to use: 90% of use cases. Articles, manuals, docs, books.
Production tip: Always use this for first-pass general text. Customize separators only for special formats.
Strategy 3: Document-structure chunking (Markdown / HTML / Code)¶
The idea: Use the document's structural markers (headings, tags, function defs) as natural boundaries.
Markdown-aware¶
from langchain_text_splitters import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
],
)
chunks = splitter.split_text(markdown_text)
# Each chunk's metadata records the header path
Each chunk knows what section it came from. Pair with RecursiveCharacterTextSplitter if sections are still too big.
flowchart LR
A[Markdown text] --> B[MarkdownHeaderTextSplitter]
B --> C[Section chunks with header metadata]
C --> D[Recursive splitter for size]
D --> E[Final chunks with structure preserved]
Code-aware¶
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
py_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=500, chunk_overlap=50,
)
Knows about def, class, function boundaries. Won't chop a def in half.
HTML-aware¶
from langchain_text_splitters import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=[("h1", "h1"), ("h2", "h2")],
)
Pros: - Preserves document structure in metadata. - Critical for citations ("found in section X → subsection Y").
Cons: - Only works for structured formats.
Strategy 4: Semantic chunking¶
The idea: Use embeddings to detect where topic shifts happen, split there.
flowchart LR
A[Long text] --> B[Split into sentences]
B --> C[Embed each sentence]
C --> D[Compute similarity between adjacent sentences]
D --> E{Similarity drop > threshold}
E -->|yes| F[Cut here - topic shift]
E -->|no| G[Keep together]
F --> H[Semantic chunks]
G --> H
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile", # or "standard_deviation"
breakpoint_threshold_amount=95,
)
chunks = splitter.create_documents([long_text])
Math behind it:
For consecutive sentences \(s_i, s_{i+1}\), compute their embeddings \(\vec{e_i}, \vec{e_{i+1}}\). Distance: \(d_i = 1 - \cos(\vec{e_i}, \vec{e_{i+1}})\).
Cut where \(d_i\) spikes above the 95th percentile of all \(d_i\) in the document — that's a "topic shift".
Pros: - Chunks are semantically coherent. - No size mismatch with topics. - Empirically improves retrieval recall by 5-15% over recursive.
Cons: - Expensive — one embedding call per sentence at ingest time. - Chunk sizes vary wildly — some 100 chars, some 3000. - Threshold tuning is empirical.
When to use: valuable for narrative documents (articles, books) where topics shift naturally. Less useful for densely structured docs (reference manuals).
Strategy 5: Agentic chunking¶
The idea: Use an LLM to read the document and decide where to chunk.
flowchart LR
A[Document] --> B[LLM: identify topic boundaries]
B --> C[LLM-proposed chunks]
C --> D[Validate sizes - re-split if needed]
D --> E[Final chunks with reasoning]
def agentic_chunk(text: str, llm):
prompt = f"""
Split the following document into chunks where each chunk is
self-contained and covers ONE topic. Return JSON list of chunks.
Document:
{text}
"""
response = llm.invoke(prompt)
return parse_chunks(response.content)
Pros: - LLM understands semantics beyond what embeddings capture. - Can preserve cross-references ("see Figure 3" stays with Figure 3).
Cons: - Expensive — one LLM call per document; for 10K docs, costs add up. - Slow indexing. - Non-deterministic (different runs → different chunks).
When to use: small, high-value corpora (legal contracts, medical guidelines) where quality justifies cost. Not for bulk web/ticket data.
Strategy 6: Hierarchical / Parent-Document chunking¶
The idea: Index small chunks (precise matching), retrieve their large parent chunks (rich context).
flowchart LR
DOC[Original Document] --> P[Parent chunks 2000 chars]
P --> C[Child chunks 200 chars]
C --> VS[Vector store with small chunks]
Q[Query] --> VS
VS --> MATCH[Match child IDs]
MATCH --> RET[Return PARENT chunks for context]
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
retriever = ParentDocumentRetriever(
vectorstore=vector_store,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(docs)
Pros: - Best of both worlds — precise matching, rich context. - The pattern most production systems converge on.
Cons: - Two data structures (vector store + doc store) to keep in sync. - More implementation complexity.
When to use: technical docs, FAQs, anywhere small queries should retrieve big context.
Strategy 7: Contextual chunking (Anthropic's "Contextual Retrieval")¶
The idea: Before embedding each chunk, prepend an LLM-generated context summary of where it sits in the document.
flowchart LR
A[Original chunk: ATP synthesis happens in...] --> B[LLM: explain context]
B --> C[Context: This chunk is from the Cellular Respiration chapter, section Mitochondria]
C --> D[Augmented chunk: CONTEXT + original text]
D --> E[Embed augmented chunk]
def contextual_chunk(chunk_text: str, document_text: str, llm):
prompt = f"""
<document>
{document_text}
</document>
Here is a chunk:
<chunk>{chunk_text}</chunk>
Provide a short 1-2 sentence context for situating this chunk
within the document. Output ONLY the context.
"""
context = llm.invoke(prompt).content
return f"Context: {context}\n\n{chunk_text}"
From Anthropic's 2024 paper: 35% reduction in retrieval failure rate vs naive chunking.
Pros: - Each chunk knows its place in the larger document. - Dramatically improves recall for ambiguous chunks (e.g., a code snippet without context).
Cons:
- Expensive — one LLM call per chunk. Use gpt-4o-mini to keep costs sane.
- Caching helps (same chunk + same doc → same context).
When to use: medium-sized, high-value corpora. Anthropic's research suggests it's worth the cost for production RAG.
Strategy 8: Late chunking (advanced)¶
The idea: Embed the WHOLE document first with a long-context embedder (e.g., jina-embeddings-v3 with 8K context). THEN chunk the resulting per-token embeddings.
flowchart LR
A[Full document up to 8K tokens] --> B[Long-context embedder]
B --> C[Per-token embeddings retain global context]
C --> D[Pool over chunk windows]
D --> E[Chunk embeddings with full doc context]
The key innovation: each chunk's embedding already saw the whole document. So a chunk that mentions "the protocol" knows which protocol from the rest of the doc.
Pros: - No need for contextual prepending (the embedding already has context). - One embedding pass per document instead of one per chunk.
Cons: - Requires a long-context embedder. - Only beneficial if you have one (e.g., Jina v3, ColBERT-v2).
When to use: cutting-edge production systems where retrieval quality is critical and you can use Jina v3 or similar.
4. Visual Learning¶
Architecture — chunking strategies side by side¶
flowchart TD
DOC[Document]
DOC --> S1[Fixed]
DOC --> S2[Recursive - default]
DOC --> S3[Structure-aware Markdown/HTML/code]
DOC --> S4[Semantic uses embeddings]
DOC --> S5[Agentic uses LLM]
DOC --> S6[Hierarchical parent+child]
DOC --> S7[Contextual LLM-augmented]
DOC --> S8[Late chunking long-context embedder]
Workflow — choosing the right strategy¶
flowchart TD
A[New corpus] --> B{Format}
B -->|Markdown| MK[MarkdownHeader + Recursive]
B -->|Code| CD[Language-aware Recursive]
B -->|HTML| HT[HTMLHeader + Recursive]
B -->|General prose| C{Corpus size}
C -->|Tiny < 1000 chunks| AG[Agentic - quality first]
C -->|Medium 1K-100K| SE[Semantic or Contextual]
C -->|Large > 100K| RC[Recursive - safe default]
C -->|Quality critical| PD[Parent Document]
Sequence — chunking + indexing pipeline¶
sequenceDiagram
participant App
participant Loader
participant Splitter
participant Embedder
participant VS as Vector Store
App->>Loader: load docs
Loader-->>App: List[Document]
App->>Splitter: split_documents
Splitter-->>App: List[Chunk]
loop Per chunk
App->>Embedder: embed
Embedder-->>App: vector
App->>VS: insert(vector, chunk, metadata)
end
Real-world example — legal contracts¶
flowchart LR
A[Contract PDF] --> B[UnstructuredPDFLoader]
B --> C[Markdown-style structure detected]
C --> D[MarkdownHeaderSplitter by clause]
D --> E[Contextual chunking with LLM]
E --> F[Each chunk knows: client, contract type, clause number, section]
F --> G[Vector store with rich metadata]
Now queries like "show me indemnity clauses from NDAs signed in 2024" work via metadata + semantic retrieval.
5. Pros (chunking in general)¶
- Granular retrieval — pinpoint paragraph-level results.
- Fits embedder budgets — every chunk under the limit.
- Cost-efficient indexing — embedding is cheap per chunk.
- Citation-friendly — chunks carry source + position metadata.
- Strategy choice per corpus — pick the right knife for the job.
6. Cons¶
- Boundary loss — even smart splitters sometimes cut bad spots.
- Sweet-spot tuning — chunk_size is empirical, not theoretical.
- Metadata bloat — for large corpora, metadata can outweigh content storage.
- Strategy mismatch — wrong choice tanks recall (10× worse than right choice).
7. Trade-offs (per strategy)¶
| Strategy | Recall | Indexing Cost | Implementation Complexity | When |
|---|---|---|---|---|
| Fixed | ⛔ Low | ✅ Free | ✅ Trivial | Synthetic data only |
| Recursive | ✅ Good | ✅ Free | ✅ Easy | General default |
| Structure-aware | ✅ Great (if structured) | ✅ Free | ✅ Easy | Markdown/HTML/code |
| Semantic | ✅ Better | ⚠️ Medium ($/doc) | ⚠️ Medium | Narrative content |
| Agentic | ✅✅ Best | ⛔ High ($/doc) | ⚠️ Medium | Small high-value corpora |
| Hierarchical | ✅✅ Best | ✅ Free | ⚠️ Medium | Technical docs, FAQs |
| Contextual | ✅✅ Best (+35%) | ⛔ High ($/chunk) | ⚠️ Medium | Production-grade RAG |
| Late chunking | ✅✅ Best | ✅ Medium | ⚠️ Hard | Cutting-edge systems |
8. Real-world Industry Usage¶
OpenAI (Assistants API)¶
- Default: Recursive chunking, ~800 tokens, 400 overlap.
- Hidden behind their managed file search; users don't tune it.
Anthropic¶
- Published their Contextual Retrieval technique in 2024 (35% improvement).
- Claude Projects use a variant with prepended summaries.
Google (Vertex AI Search)¶
- Multi-strategy: structure-aware for HTML, recursive for plain text.
- Supports per-corpus chunk size tuning.
Enterprise¶
- Bloomberg: structure-aware splitting of 10-K filings by section.
- JPMorgan: agentic chunking for high-value legal documents.
- GitHub Copilot: language-aware code splitting by function/class.
- Notion AI: section-based splitting respecting Notion's block model.
- Stripe Docs AI: Markdown header-aware for documentation.
Production patterns¶
- Multi-stage pipelines: structure splitter → semantic re-split → contextual augmentation.
- Caching contextual outputs — same chunk in same doc → reuse generated context.
- A/B testing strategies — ship two indexes, compare recall metrics, pick winner.
9. Interview Questions¶
Beginner¶
- Why do we chunk? — Embedder limits + retrieval granularity.
- Why overlap? — Preserve concepts that straddle a boundary.
- What's a good default chunk_size? — 500-1000 chars, 100-200 overlap.
Intermediate¶
- Recursive vs Fixed — why is Recursive better? — Respects natural boundaries (paragraph, sentence) before falling back.
- When does Semantic chunking beat Recursive? — Narrative content with topic shifts; expensive but better recall.
- Difference between Parent-Document and Contextual chunking? — Parent-Document indexes children, returns parents (same text augmented with structure). Contextual indexes augmented chunks (text + LLM-generated context).
- How do you decide chunk_size? — Evaluate on a held-out test set; sweep sizes 200, 500, 1000, 2000; pick best recall@k.
Advanced¶
- Anthropic's Contextual Retrieval — how does it work and why does it improve recall by 35%? — LLM-generated context prepended to each chunk before embedding; chunks are no longer "anonymous" — embedder sees them as situated in the full doc.
- Late chunking — what's novel? — Embed full document first with long-context embedder; pool per-token embeddings into chunk vectors → chunks inherit global context.
- Agentic chunking vs Semantic — when use which? — Agentic for small corpora (LLM cost amortizes), Semantic for medium (no per-chunk LLM cost), Recursive for huge.
System design¶
- Design a chunking pipeline for 100M heterogeneous documents (PDFs, HTML, code). — Route by detected MIME type to specialized splitter; fixed defaults per type; fallback to recursive; log shrinkage metric (chunks/doc) per source.
- Production team reports retrieval recall has dropped from 0.85 to 0.65 over six months. Diagnose. — Likely: corpus drift (new format ingested without proper splitter), index staleness, or query distribution shifted. Audit chunk size distribution, run RAGAS on a baseline test set.
10. Common Mistakes¶
Beginners¶
- ❌ Using
CharacterTextSplitterand being surprised chunks exceedchunk_size. - ❌ Zero overlap → critical boundary information lost.
- ❌ Splitting code with text splitter — function bodies sliced in half.
- ❌ Same chunk_size for every corpus — different content needs different sizes.
Production teams¶
- ❌ Not running structure splitter FIRST for Markdown/HTML — losing header metadata.
- ❌ Using semantic/agentic chunking at indexing time without caching.
- ❌ No chunk-size evaluation. Picking 1000 by tradition without measurement.
- ❌ Losing metadata across split — child chunks don't inherit parent ids.
How to avoid¶
- Always inherit metadata:
splitter.split_documents([doc])does this automatically. - A/B test chunk sizes early (cheap experiment, high impact).
- Cache contextual outputs by
(chunk_hash, doc_hash). - Log chunk size distribution per corpus — outliers indicate splitter failure.
11. Best Practices¶
Industry standards¶
- Recursive as default. Use language/format-aware splitters when applicable.
- 20% overlap is the safe starting point.
- Layered splitting: structure → size for Markdown/HTML.
- Token-counting via
TokenTextSplitterfor tight embedder budgets.
Production¶
- Evaluate chunk_size choices with RAGAS on a held-out test set.
- Use Parent-Document Retriever for technical docs.
- Add Contextual Retrieval if recall is critical and budget allows.
- Cache contextual augmentations.
Optimization¶
- For huge corpora: bulk-process with parallelism (
ThreadPoolExecutorover loader output). - For semantic chunking: lazy-evaluate — only run on documents where recall is critical.
- Maintain a golden test set of (query, expected chunk) pairs; re-run after every splitter change.
12. Evolution Story¶
flowchart LR
A[Whole-doc embedding] --> B[Fixed-size chunking]
B --> C[Recursive chunking]
C --> D[Structure-aware Markdown/HTML/code]
D --> E[Semantic chunking embeddings]
E --> F[Agentic chunking LLM]
F --> G[Hierarchical parent-child]
G --> H[Contextual chunking +35% recall]
H --> I[Late chunking long-context embedder]
Where we are: Chunking is a rich design space. Pick by corpus type, budget, and quality requirements.
Where we're going (next chapter): Now each chunk needs to become a vector. We need an Embedding model that captures meaning into a fixed-length vector — and we need to understand the math of similarity (cosine, dot product, Euclidean, Manhattan, Hamming, Jaccard) to know which embeddings are "close" to a query.
Practice¶
What does this print?
Expected: True
Use RecursiveCharacterTextSplitter (not CharacterTextSplitter) to enforce chunk_size
Expected: True
Quiz — Quick check¶
What you remember
Q1. Why use chunk_overlap?
- Ensures concepts spanning a boundary appear in both adjacent chunks
- Saves memory
- Speeds up embedding
- Reduces hallucination directly
Q2. Which strategy uses an LLM to GENERATE chunk context (per Anthropic's research)?
- Recursive
- Semantic
- Contextual chunking
- Fixed
Q3. When is Parent-Document Retriever the right pick?
- When small chunks match precisely but you need surrounding context
- When you have only one document
- When you need streaming
- Always — it's the default
Common doubts¶
Which strategy should I start with?
Recursive (the default). It works for 90% of content. Measure recall@k. Only switch to semantic/contextual/agentic when the metric demands it.
How do I evaluate chunking quality?
Build a small test set of (query, expected_doc_id) pairs. Run retrieval with each candidate chunking strategy. Measure recall@5. Cheapest, most impactful experiment in your RAG pipeline.
Will Contextual Retrieval replace everything?
Maybe not — it's expensive. But it's the strongest known recall booster for non-trivial corpora. Use it where stakes justify the cost.