Skip to content

Text Splitters & Chunking Strategies

1. Why does this topic exist?

We have clean Documents from the loader. Why not just embed them whole?

Three hard limits forbid it:

  1. Embedding models cap input length — typically 8K tokens (OpenAI text-embedding-3-small), 32K at the largest. A 200-page PDF is ~80K tokens. It physically won't fit.

  2. One vector can't represent 50 pages. Embedding compresses meaning into ~1500 numbers. If you cram 50 pages worth of meaning in there, every concept blurs into average. Retrieval becomes "give me the doc most generally similar" — useless.

  3. Retrieval granularity = answer quality. When user asks about a specific topic, you want the specific paragraph, not the whole book. Without chunking, you can only retrieve "books that mention X" — not "the paragraph about X".

Industry pain example: A startup embedded entire research papers as single vectors. Every query returned the same 3 papers (the broadest, most relevance-everything papers). After chunking into 800-char pieces, the same queries returned specific equations, specific figures, specific paragraphs. Recall@5 went from 0.3 → 0.91 overnight.

Why the old approach failed:

Old approach Problem
One vector per document Loses granularity
Naive split every 1000 chars Cuts mid-sentence, mid-equation
Manual section detection Bespoke per format, brittle

Chunking solves the granularity problem; smart chunking solves the boundary problem (don't cut in the middle of meaning).


2. What is it?

Simple explanation

A text splitter is a knife that knows where to cut. It breaks long documents into smaller, overlapping pieces — each small enough to embed, large enough to carry meaning.

Technical explanation

A Text Splitter is a class that takes a Document (or list) and returns a list of smaller Documents ("chunks"), governed by two parameters:

  • chunk_size: max characters/tokens per chunk
  • chunk_overlap: characters shared between consecutive chunks
flowchart LR
    A[Document<br/>10000 tokens] --> B[Splitter<br/>size=500 overlap=100]
    B --> C[Chunk 1: tokens 0-500]
    B --> D[Chunk 2: tokens 400-900]
    B --> E[Chunk 3: tokens 800-1300]
    B --> F[...]

The chunks preserve the parent document's metadata and add their own (chunk index, etc.).

Industry definition

The term "chunking" pre-dates RAG — it's from information retrieval going back to the 1970s. In the RAG era (2023+), seven strategies have emerged, each solving a different failure mode.

Mental model

Imagine slicing a long article into note cards. You want each card:

  • Small enough to study quickly.
  • Large enough to be self-contained (a single sentence isn't a card).
  • Overlapping so an idea straddling two cards lives on both.
  • Boundary-aware — never cut in the middle of a sentence or table.

That's the chunking job.

Analogy

A skilled chef cutting vegetables: uniform sizes (for even cooking), at natural joints (not through the center of a carrot), with enough overlap (small extras) to ensure nothing's wasted.


3. How does it work?

The two knobs every splitter has

Knob What it controls Typical
chunk_size Max chars/tokens per chunk 500-1500
chunk_overlap Shared chars between adjacent chunks 10-20% of chunk_size

Why overlap matters

flowchart LR
    A[Chunk A: ...the mitochondria is the powerhouse of] --> B[Chunk B: powerhouse of the cell. ATP synthesis...]

Without overlap, "powerhouse of the cell" gets split at "of" → neither chunk has the full phrase → retrieval misses queries about "powerhouse of the cell."

Token-vs-character

Most splitters count characters by default. For exact embedder-budget control, use TokenTextSplitter:

from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=300, chunk_overlap=50)

Rule of thumb: ~4 chars ≈ 1 token (English).


The 7 chunking strategies

We'll cover seven distinct strategies, from simplest to most sophisticated. Each solves a specific failure of the previous.

Strategy 1: Fixed-size chunking

The idea: Split every N characters. Period.

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="",        # no special separator
    chunk_size=500,
    chunk_overlap=100,
)
chunks = splitter.split_documents(docs)

Diagram:

flowchart LR
    A[Text: The quick brown fox jumps over the lazy dog. Mitochondria is...] --> B[Chunk 0-500]
    A --> C[Chunk 400-900]
    A --> D[Chunk 800-1300]

Pros: - Trivial implementation. - Predictable size = predictable cost.

Cons: - Cuts mid-sentence, mid-word, mid-equation. - Destroys semantic boundaries.

When to use: never, except for synthetic/uniform data (logs, structured records).

Failure case: Splitting "the temperature was 98.6°F" at position 50 yields "the temperature was 98." and "6°F". Embedder sees gibberish.


Strategy 2: Recursive chunking — the default

The idea: Try to split on natural separators in priority order — paragraph → line → sentence → word → character. Falls back to the next as needed.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(docs)

Diagram:

flowchart TD
    A[Long text] --> B{Fits in chunk_size}
    B -->|yes| KEEP[Keep as one chunk]
    B -->|no| C[Split on paragraph \\n\\n]
    C --> D{All pieces fit}
    D -->|yes| OUT1[Done]
    D -->|no| E[Split bigger pieces on line \\n]
    E --> F{All fit}
    F -->|yes| OUT2[Done]
    F -->|no| G[Split on sentence . ]
    G --> H[... recurse to word, then char]

Pros: - Respects natural boundaries when possible. - Falls back gracefully. - The right default for general prose.

Cons: - Doesn't understand semantics (just looks at characters). - May still split related ideas.

When to use: 90% of use cases. Articles, manuals, docs, books.

Production tip: Always use this for first-pass general text. Customize separators only for special formats.


Strategy 3: Document-structure chunking (Markdown / HTML / Code)

The idea: Use the document's structural markers (headings, tags, function defs) as natural boundaries.

Markdown-aware

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ],
)
chunks = splitter.split_text(markdown_text)
# Each chunk's metadata records the header path

Each chunk knows what section it came from. Pair with RecursiveCharacterTextSplitter if sections are still too big.

flowchart LR
    A[Markdown text] --> B[MarkdownHeaderTextSplitter]
    B --> C[Section chunks with header metadata]
    C --> D[Recursive splitter for size]
    D --> E[Final chunks with structure preserved]

Code-aware

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

py_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=500, chunk_overlap=50,
)

Knows about def, class, function boundaries. Won't chop a def in half.

HTML-aware

from langchain_text_splitters import HTMLHeaderTextSplitter

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "h1"), ("h2", "h2")],
)

Pros: - Preserves document structure in metadata. - Critical for citations ("found in section X → subsection Y").

Cons: - Only works for structured formats.


Strategy 4: Semantic chunking

The idea: Use embeddings to detect where topic shifts happen, split there.

flowchart LR
    A[Long text] --> B[Split into sentences]
    B --> C[Embed each sentence]
    C --> D[Compute similarity between adjacent sentences]
    D --> E{Similarity drop > threshold}
    E -->|yes| F[Cut here - topic shift]
    E -->|no| G[Keep together]
    F --> H[Semantic chunks]
    G --> H
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation"
    breakpoint_threshold_amount=95,
)
chunks = splitter.create_documents([long_text])

Math behind it:

For consecutive sentences \(s_i, s_{i+1}\), compute their embeddings \(\vec{e_i}, \vec{e_{i+1}}\). Distance: \(d_i = 1 - \cos(\vec{e_i}, \vec{e_{i+1}})\).

Cut where \(d_i\) spikes above the 95th percentile of all \(d_i\) in the document — that's a "topic shift".

Pros: - Chunks are semantically coherent. - No size mismatch with topics. - Empirically improves retrieval recall by 5-15% over recursive.

Cons: - Expensive — one embedding call per sentence at ingest time. - Chunk sizes vary wildly — some 100 chars, some 3000. - Threshold tuning is empirical.

When to use: valuable for narrative documents (articles, books) where topics shift naturally. Less useful for densely structured docs (reference manuals).


Strategy 5: Agentic chunking

The idea: Use an LLM to read the document and decide where to chunk.

flowchart LR
    A[Document] --> B[LLM: identify topic boundaries]
    B --> C[LLM-proposed chunks]
    C --> D[Validate sizes - re-split if needed]
    D --> E[Final chunks with reasoning]
def agentic_chunk(text: str, llm):
    prompt = f"""
Split the following document into chunks where each chunk is
self-contained and covers ONE topic. Return JSON list of chunks.

Document:
{text}
"""
    response = llm.invoke(prompt)
    return parse_chunks(response.content)

Pros: - LLM understands semantics beyond what embeddings capture. - Can preserve cross-references ("see Figure 3" stays with Figure 3).

Cons: - Expensive — one LLM call per document; for 10K docs, costs add up. - Slow indexing. - Non-deterministic (different runs → different chunks).

When to use: small, high-value corpora (legal contracts, medical guidelines) where quality justifies cost. Not for bulk web/ticket data.


Strategy 6: Hierarchical / Parent-Document chunking

The idea: Index small chunks (precise matching), retrieve their large parent chunks (rich context).

flowchart LR
    DOC[Original Document] --> P[Parent chunks 2000 chars]
    P --> C[Child chunks 200 chars]
    C --> VS[Vector store with small chunks]
    Q[Query] --> VS
    VS --> MATCH[Match child IDs]
    MATCH --> RET[Return PARENT chunks for context]
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter  = RecursiveCharacterTextSplitter(chunk_size=200)

retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(docs)

Pros: - Best of both worlds — precise matching, rich context. - The pattern most production systems converge on.

Cons: - Two data structures (vector store + doc store) to keep in sync. - More implementation complexity.

When to use: technical docs, FAQs, anywhere small queries should retrieve big context.


Strategy 7: Contextual chunking (Anthropic's "Contextual Retrieval")

The idea: Before embedding each chunk, prepend an LLM-generated context summary of where it sits in the document.

flowchart LR
    A[Original chunk: ATP synthesis happens in...] --> B[LLM: explain context]
    B --> C[Context: This chunk is from the Cellular Respiration chapter, section Mitochondria]
    C --> D[Augmented chunk: CONTEXT + original text]
    D --> E[Embed augmented chunk]
def contextual_chunk(chunk_text: str, document_text: str, llm):
    prompt = f"""
<document>
{document_text}
</document>
Here is a chunk:
<chunk>{chunk_text}</chunk>
Provide a short 1-2 sentence context for situating this chunk
within the document. Output ONLY the context.
"""
    context = llm.invoke(prompt).content
    return f"Context: {context}\n\n{chunk_text}"

From Anthropic's 2024 paper: 35% reduction in retrieval failure rate vs naive chunking.

Pros: - Each chunk knows its place in the larger document. - Dramatically improves recall for ambiguous chunks (e.g., a code snippet without context).

Cons: - Expensive — one LLM call per chunk. Use gpt-4o-mini to keep costs sane. - Caching helps (same chunk + same doc → same context).

When to use: medium-sized, high-value corpora. Anthropic's research suggests it's worth the cost for production RAG.


Strategy 8: Late chunking (advanced)

The idea: Embed the WHOLE document first with a long-context embedder (e.g., jina-embeddings-v3 with 8K context). THEN chunk the resulting per-token embeddings.

flowchart LR
    A[Full document up to 8K tokens] --> B[Long-context embedder]
    B --> C[Per-token embeddings retain global context]
    C --> D[Pool over chunk windows]
    D --> E[Chunk embeddings with full doc context]

The key innovation: each chunk's embedding already saw the whole document. So a chunk that mentions "the protocol" knows which protocol from the rest of the doc.

Pros: - No need for contextual prepending (the embedding already has context). - One embedding pass per document instead of one per chunk.

Cons: - Requires a long-context embedder. - Only beneficial if you have one (e.g., Jina v3, ColBERT-v2).

When to use: cutting-edge production systems where retrieval quality is critical and you can use Jina v3 or similar.


4. Visual Learning

Architecture — chunking strategies side by side

flowchart TD
    DOC[Document]
    DOC --> S1[Fixed]
    DOC --> S2[Recursive - default]
    DOC --> S3[Structure-aware Markdown/HTML/code]
    DOC --> S4[Semantic uses embeddings]
    DOC --> S5[Agentic uses LLM]
    DOC --> S6[Hierarchical parent+child]
    DOC --> S7[Contextual LLM-augmented]
    DOC --> S8[Late chunking long-context embedder]

Workflow — choosing the right strategy

flowchart TD
    A[New corpus] --> B{Format}
    B -->|Markdown| MK[MarkdownHeader + Recursive]
    B -->|Code| CD[Language-aware Recursive]
    B -->|HTML| HT[HTMLHeader + Recursive]
    B -->|General prose| C{Corpus size}
    C -->|Tiny < 1000 chunks| AG[Agentic - quality first]
    C -->|Medium 1K-100K| SE[Semantic or Contextual]
    C -->|Large > 100K| RC[Recursive - safe default]
    C -->|Quality critical| PD[Parent Document]

Sequence — chunking + indexing pipeline

sequenceDiagram
    participant App
    participant Loader
    participant Splitter
    participant Embedder
    participant VS as Vector Store
    App->>Loader: load docs
    Loader-->>App: List[Document]
    App->>Splitter: split_documents
    Splitter-->>App: List[Chunk]
    loop Per chunk
        App->>Embedder: embed
        Embedder-->>App: vector
        App->>VS: insert(vector, chunk, metadata)
    end
flowchart LR
    A[Contract PDF] --> B[UnstructuredPDFLoader]
    B --> C[Markdown-style structure detected]
    C --> D[MarkdownHeaderSplitter by clause]
    D --> E[Contextual chunking with LLM]
    E --> F[Each chunk knows: client, contract type, clause number, section]
    F --> G[Vector store with rich metadata]

Now queries like "show me indemnity clauses from NDAs signed in 2024" work via metadata + semantic retrieval.


5. Pros (chunking in general)

  • Granular retrieval — pinpoint paragraph-level results.
  • Fits embedder budgets — every chunk under the limit.
  • Cost-efficient indexing — embedding is cheap per chunk.
  • Citation-friendly — chunks carry source + position metadata.
  • Strategy choice per corpus — pick the right knife for the job.

6. Cons

  • Boundary loss — even smart splitters sometimes cut bad spots.
  • Sweet-spot tuning — chunk_size is empirical, not theoretical.
  • Metadata bloat — for large corpora, metadata can outweigh content storage.
  • Strategy mismatch — wrong choice tanks recall (10× worse than right choice).

7. Trade-offs (per strategy)

Strategy Recall Indexing Cost Implementation Complexity When
Fixed ⛔ Low ✅ Free ✅ Trivial Synthetic data only
Recursive ✅ Good ✅ Free ✅ Easy General default
Structure-aware ✅ Great (if structured) ✅ Free ✅ Easy Markdown/HTML/code
Semantic ✅ Better ⚠️ Medium ($/doc) ⚠️ Medium Narrative content
Agentic ✅✅ Best ⛔ High ($/doc) ⚠️ Medium Small high-value corpora
Hierarchical ✅✅ Best ✅ Free ⚠️ Medium Technical docs, FAQs
Contextual ✅✅ Best (+35%) ⛔ High ($/chunk) ⚠️ Medium Production-grade RAG
Late chunking ✅✅ Best ✅ Medium ⚠️ Hard Cutting-edge systems

8. Real-world Industry Usage

OpenAI (Assistants API)

  • Default: Recursive chunking, ~800 tokens, 400 overlap.
  • Hidden behind their managed file search; users don't tune it.

Anthropic

  • Published their Contextual Retrieval technique in 2024 (35% improvement).
  • Claude Projects use a variant with prepended summaries.
  • Multi-strategy: structure-aware for HTML, recursive for plain text.
  • Supports per-corpus chunk size tuning.

Enterprise

  • Bloomberg: structure-aware splitting of 10-K filings by section.
  • JPMorgan: agentic chunking for high-value legal documents.
  • GitHub Copilot: language-aware code splitting by function/class.
  • Notion AI: section-based splitting respecting Notion's block model.
  • Stripe Docs AI: Markdown header-aware for documentation.

Production patterns

  • Multi-stage pipelines: structure splitter → semantic re-split → contextual augmentation.
  • Caching contextual outputs — same chunk in same doc → reuse generated context.
  • A/B testing strategies — ship two indexes, compare recall metrics, pick winner.

9. Interview Questions

Beginner

  1. Why do we chunk? — Embedder limits + retrieval granularity.
  2. Why overlap? — Preserve concepts that straddle a boundary.
  3. What's a good default chunk_size? — 500-1000 chars, 100-200 overlap.

Intermediate

  1. Recursive vs Fixed — why is Recursive better? — Respects natural boundaries (paragraph, sentence) before falling back.
  2. When does Semantic chunking beat Recursive? — Narrative content with topic shifts; expensive but better recall.
  3. Difference between Parent-Document and Contextual chunking? — Parent-Document indexes children, returns parents (same text augmented with structure). Contextual indexes augmented chunks (text + LLM-generated context).
  4. How do you decide chunk_size? — Evaluate on a held-out test set; sweep sizes 200, 500, 1000, 2000; pick best recall@k.

Advanced

  1. Anthropic's Contextual Retrieval — how does it work and why does it improve recall by 35%? — LLM-generated context prepended to each chunk before embedding; chunks are no longer "anonymous" — embedder sees them as situated in the full doc.
  2. Late chunking — what's novel? — Embed full document first with long-context embedder; pool per-token embeddings into chunk vectors → chunks inherit global context.
  3. Agentic chunking vs Semantic — when use which? — Agentic for small corpora (LLM cost amortizes), Semantic for medium (no per-chunk LLM cost), Recursive for huge.

System design

  1. Design a chunking pipeline for 100M heterogeneous documents (PDFs, HTML, code). — Route by detected MIME type to specialized splitter; fixed defaults per type; fallback to recursive; log shrinkage metric (chunks/doc) per source.
  2. Production team reports retrieval recall has dropped from 0.85 to 0.65 over six months. Diagnose. — Likely: corpus drift (new format ingested without proper splitter), index staleness, or query distribution shifted. Audit chunk size distribution, run RAGAS on a baseline test set.

10. Common Mistakes

Beginners

  • ❌ Using CharacterTextSplitter and being surprised chunks exceed chunk_size.
  • ❌ Zero overlap → critical boundary information lost.
  • ❌ Splitting code with text splitter — function bodies sliced in half.
  • ❌ Same chunk_size for every corpus — different content needs different sizes.

Production teams

  • ❌ Not running structure splitter FIRST for Markdown/HTML — losing header metadata.
  • ❌ Using semantic/agentic chunking at indexing time without caching.
  • ❌ No chunk-size evaluation. Picking 1000 by tradition without measurement.
  • ❌ Losing metadata across split — child chunks don't inherit parent ids.

How to avoid

  • Always inherit metadata: splitter.split_documents([doc]) does this automatically.
  • A/B test chunk sizes early (cheap experiment, high impact).
  • Cache contextual outputs by (chunk_hash, doc_hash).
  • Log chunk size distribution per corpus — outliers indicate splitter failure.

11. Best Practices

Industry standards

  • Recursive as default. Use language/format-aware splitters when applicable.
  • 20% overlap is the safe starting point.
  • Layered splitting: structure → size for Markdown/HTML.
  • Token-counting via TokenTextSplitter for tight embedder budgets.

Production

  • Evaluate chunk_size choices with RAGAS on a held-out test set.
  • Use Parent-Document Retriever for technical docs.
  • Add Contextual Retrieval if recall is critical and budget allows.
  • Cache contextual augmentations.

Optimization

  • For huge corpora: bulk-process with parallelism (ThreadPoolExecutor over loader output).
  • For semantic chunking: lazy-evaluate — only run on documents where recall is critical.
  • Maintain a golden test set of (query, expected chunk) pairs; re-run after every splitter change.

12. Evolution Story

flowchart LR
    A[Whole-doc embedding] --> B[Fixed-size chunking]
    B --> C[Recursive chunking]
    C --> D[Structure-aware Markdown/HTML/code]
    D --> E[Semantic chunking embeddings]
    E --> F[Agentic chunking LLM]
    F --> G[Hierarchical parent-child]
    G --> H[Contextual chunking +35% recall]
    H --> I[Late chunking long-context embedder]

Where we are: Chunking is a rich design space. Pick by corpus type, budget, and quality requirements.

Where we're going (next chapter): Now each chunk needs to become a vector. We need an Embedding model that captures meaning into a fixed-length vector — and we need to understand the math of similarity (cosine, dot product, Euclidean, Manhattan, Hamming, Jaccard) to know which embeddings are "close" to a query.


Practice

What does this print?

Expected: True

chunk_size, chunk_overlap = 1000, 200
is_typical = 0.1 * chunk_size <= chunk_overlap <= 0.25 * chunk_size
print(is_typical)

Use RecursiveCharacterTextSplitter (not CharacterTextSplitter) to enforce chunk_size

Expected: True

splitter = "CharacterTextSplitter"
is_safe = splitter == "RecursiveCharacterTextSplitter"
print(not is_safe)

Quiz — Quick check

What you remember

Q1. Why use chunk_overlap?

  • Ensures concepts spanning a boundary appear in both adjacent chunks
  • Saves memory
  • Speeds up embedding
  • Reduces hallucination directly

Q2. Which strategy uses an LLM to GENERATE chunk context (per Anthropic's research)?

  • Recursive
  • Semantic
  • Contextual chunking
  • Fixed

Q3. When is Parent-Document Retriever the right pick?

  • When small chunks match precisely but you need surrounding context
  • When you have only one document
  • When you need streaming
  • Always — it's the default

Common doubts

Which strategy should I start with?

Recursive (the default). It works for 90% of content. Measure recall@k. Only switch to semantic/contextual/agentic when the metric demands it.

How do I evaluate chunking quality?

Build a small test set of (query, expected_doc_id) pairs. Run retrieval with each candidate chunking strategy. Measure recall@5. Cheapest, most impactful experiment in your RAG pipeline.

Will Contextual Retrieval replace everything?

Maybe not — it's expensive. But it's the strongest known recall booster for non-trivial corpora. Use it where stakes justify the cost.

Text Embeddings & Similarity Search