Text Splitters & Chunking Strategies¶

1. Why does this topic exist?¶

We have clean Documents from the loader. Why not just embed them whole?

Three hard limits forbid it:

Embedding models cap input length — typically 8K tokens (OpenAI text-embedding-3-small), 32K at the largest. A 200-page PDF is ~80K tokens. It physically won't fit.
One vector can't represent 50 pages. Embedding compresses meaning into ~1500 numbers. If you cram 50 pages worth of meaning in there, every concept blurs into average. Retrieval becomes "give me the doc most generally similar" — useless.
Retrieval granularity = answer quality. When user asks about a specific topic, you want the specific paragraph, not the whole book. Without chunking, you can only retrieve "books that mention X" — not "the paragraph about X".

Industry pain example: A startup embedded entire research papers as single vectors. Every query returned the same 3 papers (the broadest, most relevance-everything papers). After chunking into 800-char pieces, the same queries returned specific equations, specific figures, specific paragraphs. Recall@5 went from 0.3 → 0.91 overnight.

Why the old approach failed:

Old approach	Problem
One vector per document	Loses granularity
Naive split every 1000 chars	Cuts mid-sentence, mid-equation
Manual section detection	Bespoke per format, brittle

Chunking solves the granularity problem; smart chunking solves the boundary problem (don't cut in the middle of meaning).

2. What is it?¶

Simple explanation¶

A text splitter is a knife that knows where to cut. It breaks long documents into smaller, overlapping pieces — each small enough to embed, large enough to carry meaning.

Technical explanation¶

A Text Splitter is a class that takes a Document (or list) and returns a list of smaller Documents ("chunks"), governed by two parameters:

chunk_size: max characters/tokens per chunk
chunk_overlap: characters shared between consecutive chunks

flowchart LR
    A[Document<br/>10000 tokens] --> B[Splitter<br/>size=500 overlap=100]
    B --> C[Chunk 1: tokens 0-500]
    B --> D[Chunk 2: tokens 400-900]
    B --> E[Chunk 3: tokens 800-1300]
    B --> F[...]

The chunks preserve the parent document's metadata and add their own (chunk index, etc.).

Industry definition¶

The term "chunking" pre-dates RAG — it's from information retrieval going back to the 1970s. In the RAG era (2023+), seven strategies have emerged, each solving a different failure mode.

Mental model¶

Imagine slicing a long article into note cards. You want each card:

Small enough to study quickly.
Large enough to be self-contained (a single sentence isn't a card).
Overlapping so an idea straddling two cards lives on both.
Boundary-aware — never cut in the middle of a sentence or table.

That's the chunking job.

Analogy¶

A skilled chef cutting vegetables: uniform sizes (for even cooking), at natural joints (not through the center of a carrot), with enough overlap (small extras) to ensure nothing's wasted.

3. How does it work?¶

The two knobs every splitter has¶

Knob	What it controls	Typical
`chunk_size`	Max chars/tokens per chunk	500-1500
`chunk_overlap`	Shared chars between adjacent chunks	10-20% of chunk_size

Why overlap matters¶

flowchart LR
    A[Chunk A: ...the mitochondria is the powerhouse of] --> B[Chunk B: powerhouse of the cell. ATP synthesis...]

Without overlap, "powerhouse of the cell" gets split at "of" → neither chunk has the full phrase → retrieval misses queries about "powerhouse of the cell."

Token-vs-character¶

Most splitters count characters by default. For exact embedder-budget control, use TokenTextSplitter:

from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=300, chunk_overlap=50)

Rule of thumb: ~4 chars ≈ 1 token (English).

The 7 chunking strategies¶

We'll cover seven distinct strategies, from simplest to most sophisticated. Each solves a specific failure of the previous.

Strategy 1: Fixed-size chunking¶

The idea: Split every N characters. Period.

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="",        # no special separator
    chunk_size=500,
    chunk_overlap=100,
)
chunks = splitter.split_documents(docs)

Diagram:

flowchart LR
    A[Text: The quick brown fox jumps over the lazy dog. Mitochondria is...] --> B[Chunk 0-500]
    A --> C[Chunk 400-900]
    A --> D[Chunk 800-1300]

Pros: - Trivial implementation. - Predictable size = predictable cost.

Cons: - Cuts mid-sentence, mid-word, mid-equation. - Destroys semantic boundaries.

When to use: never, except for synthetic/uniform data (logs, structured records).

Failure case: Splitting "the temperature was 98.6°F" at position 50 yields "the temperature was 98." and "6°F". Embedder sees gibberish.

Strategy 2: Recursive chunking — the default¶

The idea: Try to split on natural separators in priority order — paragraph → line → sentence → word → character. Falls back to the next as needed.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(docs)

Diagram:

flowchart TD
    A[Long text] --> B{Fits in chunk_size}
    B -->|yes| KEEP[Keep as one chunk]
    B -->|no| C[Split on paragraph \\n\\n]
    C --> D{All pieces fit}
    D -->|yes| OUT1[Done]
    D -->|no| E[Split bigger pieces on line \\n]
    E --> F{All fit}
    F -->|yes| OUT2[Done]
    F -->|no| G[Split on sentence . ]
    G --> H[... recurse to word, then char]

Pros: - Respects natural boundaries when possible. - Falls back gracefully. - The right default for general prose.

Cons: - Doesn't understand semantics (just looks at characters). - May still split related ideas.

When to use: 90% of use cases. Articles, manuals, docs, books.

Production tip: Always use this for first-pass general text. Customize separators only for special formats.

Strategy 3: Document-structure chunking (Markdown / HTML / Code)¶

The idea: Use the document's structural markers (headings, tags, function defs) as natural boundaries.

Markdown-aware¶

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ],
)
chunks = splitter.split_text(markdown_text)
# Each chunk's metadata records the header path

Each chunk knows what section it came from. Pair with RecursiveCharacterTextSplitter if sections are still too big.

flowchart LR
    A[Markdown text] --> B[MarkdownHeaderTextSplitter]
    B --> C[Section chunks with header metadata]
    C --> D[Recursive splitter for size]
    D --> E[Final chunks with structure preserved]

Code-aware¶

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

py_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=500, chunk_overlap=50,
)

Knows about def, class, function boundaries. Won't chop a def in half.

HTML-aware¶

from langchain_text_splitters import HTMLHeaderTextSplitter

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "h1"), ("h2", "h2")],
)

Pros: - Preserves document structure in metadata. - Critical for citations ("found in section X → subsection Y").

Cons: - Only works for structured formats.

Strategy 4: Semantic chunking¶

The idea: Use embeddings to detect where topic shifts happen, split there.

flowchart LR
    A[Long text] --> B[Split into sentences]
    B --> C[Embed each sentence]
    C --> D[Compute similarity between adjacent sentences]
    D --> E{Similarity drop > threshold}
    E -->|yes| F[Cut here - topic shift]
    E -->|no| G[Keep together]
    F --> H[Semantic chunks]
    G --> H

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation"
    breakpoint_threshold_amount=95,
)
chunks = splitter.create_documents([long_text])

Math behind it:

For consecutive sentences $s_i, s_{i+1}$, compute their embeddings $\vec{e_i}, \vec{e_{i+1}}$. Distance: $d_i = 1 - \cos(\vec{e_i}, \vec{e_{i+1}})$.

Cut where $d_i$ spikes above the 95^th percentile of all $d_i$ in the document — that's a "topic shift".

Pros: - Chunks are semantically coherent. - No size mismatch with topics. - Empirically improves retrieval recall by 5-15% over recursive.

Cons: - Expensive — one embedding call per sentence at ingest time. - Chunk sizes vary wildly — some 100 chars, some 3000. - Threshold tuning is empirical.

When to use: valuable for narrative documents (articles, books) where topics shift naturally. Less useful for densely structured docs (reference manuals).

Strategy 5: Agentic chunking¶

The idea: Use an LLM to read the document and decide where to chunk.

flowchart LR
    A[Document] --> B[LLM: identify topic boundaries]
    B --> C[LLM-proposed chunks]
    C --> D[Validate sizes - re-split if needed]
    D --> E[Final chunks with reasoning]

def agentic_chunk(text: str, llm):
    prompt = f"""
Split the following document into chunks where each chunk is
self-contained and covers ONE topic. Return JSON list of chunks.

Document:
{text}
"""
    response = llm.invoke(prompt)
    return parse_chunks(response.content)

Pros: - LLM understands semantics beyond what embeddings capture. - Can preserve cross-references ("see Figure 3" stays with Figure 3).

Cons: - Expensive — one LLM call per document; for 10K docs, costs add up. - Slow indexing. - Non-deterministic (different runs → different chunks).

When to use: small, high-value corpora (legal contracts, medical guidelines) where quality justifies cost. Not for bulk web/ticket data.

Strategy 6: Hierarchical / Parent-Document chunking¶

The idea: Index small chunks (precise matching), retrieve their large parent chunks (rich context).

flowchart LR
    DOC[Original Document] --> P[Parent chunks 2000 chars]
    P --> C[Child chunks 200 chars]
    C --> VS[Vector store with small chunks]
    Q[Query] --> VS
    VS --> MATCH[Match child IDs]
    MATCH --> RET[Return PARENT chunks for context]

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter  = RecursiveCharacterTextSplitter(chunk_size=200)

retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(docs)

Pros: - Best of both worlds — precise matching, rich context. - The pattern most production systems converge on.

Cons: - Two data structures (vector store + doc store) to keep in sync. - More implementation complexity.

When to use: technical docs, FAQs, anywhere small queries should retrieve big context.

Strategy 7: Contextual chunking (Anthropic's "Contextual Retrieval")¶

The idea: Before embedding each chunk, prepend an LLM-generated context summary of where it sits in the document.

flowchart LR
    A[Original chunk: ATP synthesis happens in...] --> B[LLM: explain context]
    B --> C[Context: This chunk is from the Cellular Respiration chapter, section Mitochondria]
    C --> D[Augmented chunk: CONTEXT + original text]
    D --> E[Embed augmented chunk]

def contextual_chunk(chunk_text: str, document_text: str, llm):
    prompt = f"""
<document>
{document_text}
</document>
Here is a chunk:
<chunk>{chunk_text}</chunk>
Provide a short 1-2 sentence context for situating this chunk
within the document. Output ONLY the context.
"""
    context = llm.invoke(prompt).content
    return f"Context: {context}\n\n{chunk_text}"

From Anthropic's 2024 paper: 35% reduction in retrieval failure rate vs naive chunking.

Pros: - Each chunk knows its place in the larger document. - Dramatically improves recall for ambiguous chunks (e.g., a code snippet without context).

Cons: - Expensive — one LLM call per chunk. Use gpt-4o-mini to keep costs sane. - Caching helps (same chunk + same doc → same context).

When to use: medium-sized, high-value corpora. Anthropic's research suggests it's worth the cost for production RAG.

Strategy 8: Late chunking (advanced)¶

The idea: Embed the WHOLE document first with a long-context embedder (e.g., jina-embeddings-v3 with 8K context). THEN chunk the resulting per-token embeddings.

flowchart LR
    A[Full document up to 8K tokens] --> B[Long-context embedder]
    B --> C[Per-token embeddings retain global context]
    C --> D[Pool over chunk windows]
    D --> E[Chunk embeddings with full doc context]

The key innovation: each chunk's embedding already saw the whole document. So a chunk that mentions "the protocol" knows which protocol from the rest of the doc.

Pros: - No need for contextual prepending (the embedding already has context). - One embedding pass per document instead of one per chunk.

Cons: - Requires a long-context embedder. - Only beneficial if you have one (e.g., Jina v3, ColBERT-v2).

When to use: cutting-edge production systems where retrieval quality is critical and you can use Jina v3 or similar.

4. Visual Learning¶

Architecture — chunking strategies side by side¶

flowchart TD
    DOC[Document]
    DOC --> S1[Fixed]
    DOC --> S2[Recursive - default]
    DOC --> S3[Structure-aware Markdown/HTML/code]
    DOC --> S4[Semantic uses embeddings]
    DOC --> S5[Agentic uses LLM]
    DOC --> S6[Hierarchical parent+child]
    DOC --> S7[Contextual LLM-augmented]
    DOC --> S8[Late chunking long-context embedder]

Workflow — choosing the right strategy¶

flowchart TD
    A[New corpus] --> B{Format}
    B -->|Markdown| MK[MarkdownHeader + Recursive]
    B -->|Code| CD[Language-aware Recursive]
    B -->|HTML| HT[HTMLHeader + Recursive]
    B -->|General prose| C{Corpus size}
    C -->|Tiny < 1000 chunks| AG[Agentic - quality first]
    C -->|Medium 1K-100K| SE[Semantic or Contextual]
    C -->|Large > 100K| RC[Recursive - safe default]
    C -->|Quality critical| PD[Parent Document]

Sequence — chunking + indexing pipeline¶

sequenceDiagram
    participant App
    participant Loader
    participant Splitter
    participant Embedder
    participant VS as Vector Store
    App->>Loader: load docs
    Loader-->>App: List[Document]
    App->>Splitter: split_documents
    Splitter-->>App: List[Chunk]
    loop Per chunk
        App->>Embedder: embed
        Embedder-->>App: vector
        App->>VS: insert(vector, chunk, metadata)
    end

Real-world example — legal contracts¶

flowchart LR
    A[Contract PDF] --> B[UnstructuredPDFLoader]
    B --> C[Markdown-style structure detected]
    C --> D[MarkdownHeaderSplitter by clause]
    D --> E[Contextual chunking with LLM]
    E --> F[Each chunk knows: client, contract type, clause number, section]
    F --> G[Vector store with rich metadata]

Now queries like "show me indemnity clauses from NDAs signed in 2024" work via metadata + semantic retrieval.

5. Pros (chunking in general)¶

Granular retrieval — pinpoint paragraph-level results.
Fits embedder budgets — every chunk under the limit.
Cost-efficient indexing — embedding is cheap per chunk.
Citation-friendly — chunks carry source + position metadata.
Strategy choice per corpus — pick the right knife for the job.

6. Cons¶

Boundary loss — even smart splitters sometimes cut bad spots.
Sweet-spot tuning — chunk_size is empirical, not theoretical.
Metadata bloat — for large corpora, metadata can outweigh content storage.
Strategy mismatch — wrong choice tanks recall (10× worse than right choice).

7. Trade-offs (per strategy)¶

Strategy	Recall	Indexing Cost	Implementation Complexity	When
Fixed	⛔ Low	✅ Free	✅ Trivial	Synthetic data only
Recursive	✅ Good	✅ Free	✅ Easy	General default
Structure-aware	✅ Great (if structured)	✅ Free	✅ Easy	Markdown/HTML/code
Semantic	✅ Better	⚠️ Medium ($/doc)	⚠️ Medium	Narrative content
Agentic	✅✅ Best	⛔ High ($/doc)	⚠️ Medium	Small high-value corpora
Hierarchical	✅✅ Best	✅ Free	⚠️ Medium	Technical docs, FAQs
Contextual	✅✅ Best (+35%)	⛔ High ($/chunk)	⚠️ Medium	Production-grade RAG
Late chunking	✅✅ Best	✅ Medium	⚠️ Hard	Cutting-edge systems

8. Real-world Industry Usage¶

OpenAI (Assistants API)¶

Default: Recursive chunking, ~800 tokens, 400 overlap.
Hidden behind their managed file search; users don't tune it.

Anthropic¶

Published their Contextual Retrieval technique in 2024 (35% improvement).
Claude Projects use a variant with prepended summaries.

Google (Vertex AI Search)¶

Multi-strategy: structure-aware for HTML, recursive for plain text.
Supports per-corpus chunk size tuning.

Enterprise¶

Bloomberg: structure-aware splitting of 10-K filings by section.
JPMorgan: agentic chunking for high-value legal documents.
GitHub Copilot: language-aware code splitting by function/class.
Notion AI: section-based splitting respecting Notion's block model.
Stripe Docs AI: Markdown header-aware for documentation.

Production patterns¶

Multi-stage pipelines: structure splitter → semantic re-split → contextual augmentation.
Caching contextual outputs — same chunk in same doc → reuse generated context.
A/B testing strategies — ship two indexes, compare recall metrics, pick winner.

9. Interview Questions¶

Beginner¶

Why do we chunk? — Embedder limits + retrieval granularity.
Why overlap? — Preserve concepts that straddle a boundary.
What's a good default chunk_size? — 500-1000 chars, 100-200 overlap.

Intermediate¶

Recursive vs Fixed — why is Recursive better? — Respects natural boundaries (paragraph, sentence) before falling back.
When does Semantic chunking beat Recursive? — Narrative content with topic shifts; expensive but better recall.
Difference between Parent-Document and Contextual chunking? — Parent-Document indexes children, returns parents (same text augmented with structure). Contextual indexes augmented chunks (text + LLM-generated context).
How do you decide chunk_size? — Evaluate on a held-out test set; sweep sizes 200, 500, 1000, 2000; pick best recall@k.

Advanced¶

Anthropic's Contextual Retrieval — how does it work and why does it improve recall by 35%? — LLM-generated context prepended to each chunk before embedding; chunks are no longer "anonymous" — embedder sees them as situated in the full doc.
Late chunking — what's novel? — Embed full document first with long-context embedder; pool per-token embeddings into chunk vectors → chunks inherit global context.
Agentic chunking vs Semantic — when use which? — Agentic for small corpora (LLM cost amortizes), Semantic for medium (no per-chunk LLM cost), Recursive for huge.

System design¶

Design a chunking pipeline for 100M heterogeneous documents (PDFs, HTML, code). — Route by detected MIME type to specialized splitter; fixed defaults per type; fallback to recursive; log shrinkage metric (chunks/doc) per source.
Production team reports retrieval recall has dropped from 0.85 to 0.65 over six months. Diagnose. — Likely: corpus drift (new format ingested without proper splitter), index staleness, or query distribution shifted. Audit chunk size distribution, run RAGAS on a baseline test set.

10. Common Mistakes¶

Beginners¶

❌ Using CharacterTextSplitter and being surprised chunks exceed chunk_size.
❌ Zero overlap → critical boundary information lost.
❌ Splitting code with text splitter — function bodies sliced in half.
❌ Same chunk_size for every corpus — different content needs different sizes.

Production teams¶

❌ Not running structure splitter FIRST for Markdown/HTML — losing header metadata.
❌ Using semantic/agentic chunking at indexing time without caching.
❌ No chunk-size evaluation. Picking 1000 by tradition without measurement.
❌ Losing metadata across split — child chunks don't inherit parent ids.

How to avoid¶

Always inherit metadata: splitter.split_documents([doc]) does this automatically.
A/B test chunk sizes early (cheap experiment, high impact).
Cache contextual outputs by (chunk_hash, doc_hash).
Log chunk size distribution per corpus — outliers indicate splitter failure.

11. Best Practices¶

Industry standards¶

Recursive as default. Use language/format-aware splitters when applicable.
20% overlap is the safe starting point.
Layered splitting: structure → size for Markdown/HTML.
Token-counting via TokenTextSplitter for tight embedder budgets.

Production¶

Evaluate chunk_size choices with RAGAS on a held-out test set.
Use Parent-Document Retriever for technical docs.
Add Contextual Retrieval if recall is critical and budget allows.
Cache contextual augmentations.

Optimization¶

For huge corpora: bulk-process with parallelism (ThreadPoolExecutor over loader output).
For semantic chunking: lazy-evaluate — only run on documents where recall is critical.
Maintain a golden test set of (query, expected chunk) pairs; re-run after every splitter change.

12. Evolution Story¶

flowchart LR
    A[Whole-doc embedding] --> B[Fixed-size chunking]
    B --> C[Recursive chunking]
    C --> D[Structure-aware Markdown/HTML/code]
    D --> E[Semantic chunking embeddings]
    E --> F[Agentic chunking LLM]
    F --> G[Hierarchical parent-child]
    G --> H[Contextual chunking +35% recall]
    H --> I[Late chunking long-context embedder]

Where we are: Chunking is a rich design space. Pick by corpus type, budget, and quality requirements.

Where we're going (next chapter): Now each chunk needs to become a vector. We need an Embedding model that captures meaning into a fixed-length vector — and we need to understand the math of similarity (cosine, dot product, Euclidean, Manhattan, Hamming, Jaccard) to know which embeddings are "close" to a query.

Practice¶

What does this print?

Expected: True

chunk_size, chunk_overlap = 1000, 200
is_typical = 0.1 * chunk_size <= chunk_overlap <= 0.25 * chunk_size
print(is_typical)

Use RecursiveCharacterTextSplitter (not CharacterTextSplitter) to enforce chunk_size

Expected: True

splitter = "CharacterTextSplitter"
is_safe = splitter == "RecursiveCharacterTextSplitter"
print(not is_safe)

Quiz — Quick check¶

What you remember

Q1. Why use chunk_overlap?

Ensures concepts spanning a boundary appear in both adjacent chunks
Saves memory
Speeds up embedding
Reduces hallucination directly

Q2. Which strategy uses an LLM to GENERATE chunk context (per Anthropic's research)?

Recursive
Semantic
Contextual chunking
Fixed

Q3. When is Parent-Document Retriever the right pick?

When small chunks match precisely but you need surrounding context
When you have only one document
When you need streaming
Always — it's the default

Common doubts¶

Which strategy should I start with?

Recursive (the default). It works for 90% of content. Measure recall@k. Only switch to semantic/contextual/agentic when the metric demands it.

How do I evaluate chunking quality?

Build a small test set of (query, expected_doc_id) pairs. Run retrieval with each candidate chunking strategy. Measure recall@5. Cheapest, most impactful experiment in your RAG pipeline.

Will Contextual Retrieval replace everything?

Maybe not — it's expensive. But it's the strongest known recall booster for non-trivial corpora. Use it where stakes justify the cost.

→ Text Embeddings & Similarity Search