Tuesday, February 25, 2025

The Art of Document Chunking for LLM Applications

Abdellatif Abdelfattah
Abdellatif Abdelfattah
Knowledge

The Art of Document Chunking for LLM Applications

How you split documents into pieces matters more than you'd think. Get it right, and your AI gives precise, relevant answers. Get it wrong, and users get frustrated by incomplete or tangential responses.

We've built and optimized chunking strategies for over 40 RAG systems processing everything from technical documentation to legal contracts. Here's what actually moves the needle on retrieval quality.

Why Chunking Matters

LLMs can't process entire documents at once. Even models with huge context windows benefit from receiving targeted, relevant excerpts rather than everything at once.

The challenge: split documents into pieces that are:

  • Self-contained: Each chunk makes sense on its own
  • Coherent: Related information stays together
  • Relevant: Focused enough to match specific queries
  • Contextual: Includes enough surrounding information to be understood

These goals often conflict. Finding the right balance is the art.

In one system we built, switching from 500-character to 1,000-character chunks improved answer completeness by 35% but reduced precision slightly. Users overwhelmingly preferred the longer chunks despite the tradeoff.

The Naive Approach: Fixed Size

The simplest strategy: split every N characters, regardless of content.

This works surprisingly often. It's predictable, fast, and easy to implement. But it has obvious problems:

  • Sentences get cut mid-thought
  • Paragraphs get split randomly
  • Related information ends up in different chunks
  • No respect for document structure

We use fixed-size chunking for one system processing standardized forms—where document structure is consistent and splitting randomly works fine. But for most content, you can do better.

Respecting Boundaries

Better chunking respects natural document boundaries:

Sentence Boundaries Split on complete sentences. Now at least each chunk contains coherent thoughts. The problem: sentences vary wildly in length. Some documents have 100-word sentences; others use fragments.

We tested sentence-based chunking on technical documentation. Average chunk size varied from 200 to 2,500 characters. The inconsistency hurt retrieval—short chunks lacked context, long ones included too much.

Paragraph Boundaries Split on paragraph breaks. This often maps well to complete ideas. But academic papers have 500-word paragraphs while marketing copy uses one-sentence paragraphs.

Paragraph-based chunking works well for business reports and articles. We see 15-20% better retrieval accuracy compared to fixed-size when paragraph structure is meaningful.

Section Boundaries For structured documents, splitting on sections (chapters, headings) can work beautifully. But not all documents have clear sections, and some sections are way too large.

One knowledge base we optimized had sections ranging from 100 to 10,000 words. We had to implement hierarchical chunking—splitting large sections further while keeping small sections intact.

The Overlap Trick

Here's a simple technique that dramatically improves retrieval: let chunks overlap.

If chunks are 1,000 characters, have each chunk include the last 200 characters of the previous one. This prevents information from getting lost at boundaries.

The overlap costs storage and processing but pays for itself in better retrieval. When a key fact appears near a chunk boundary, it's now in two chunks instead of awkwardly split.

We A/B tested overlap percentages from 0% to 30%. The sweet spot for most content is 15-20% overlap (150-200 characters for 1,000-character chunks). Beyond 20%, we saw diminishing returns—storage costs increased faster than retrieval improved.

Semantic Chunking

The most sophisticated approach: use AI to understand content relationships and split where topics change.

This creates chunks where information naturally clusters. It handles documents with irregular structure. It adapts to the content rather than imposing arbitrary rules.

The downsides: it's computationally expensive, requires API calls for embeddings, and can be unpredictable. Use it when quality matters more than speed or cost.

We implemented semantic chunking for a legal document system where precision was critical. Processing costs increased 8x, but retrieval accuracy improved 28%. For that use case, the tradeoff was worth it. For high-volume customer support docs, it wasn't.

Metadata Is Your Secret Weapon

Every chunk should carry metadata about where it came from:

  • Source document name and ID
  • Section or chapter title
  • Page number
  • Creation date
  • Document type

This metadata enables powerful filtering. "Find information about pricing in the 2024 sales guide" becomes straightforward when chunks know which document and year they're from.

Metadata also improves citations. Instead of "according to the documents," you can say "according to the Q3 Financial Report, page 15."

In one system, adding section metadata to chunks improved user satisfaction scores from 3.2 to 4.1 out of 5. Users trusted answers more when they could see exactly where information came from.

The Hierarchical Approach

Different queries need different levels of detail. Sometimes users want a quick overview; other times they need specific details.

We built a hierarchical system for a client with 10,000+ technical documents:

  • Summaries: High-level document overviews (200-300 words)
  • Sections: Mid-level chunks covering major topics (500-800 words)
  • Paragraphs: Detailed chunks with specific information (150-300 words)

The system retrieves at the appropriate level based on query type. Broad questions like "What does this product do?" get summaries. Specific questions like "What's the API rate limit?" get detailed chunks.

User queries requiring multiple passes dropped from 40% to 15%. People found what they needed on the first try more often.

What Size Actually Works?

The eternal question: how big should chunks be?

Here's what we've found works across different content types:

Technical Documentation: 500-800 characters Dense information, lots of references between concepts. Smaller chunks keep each piece focused. We tested this extensively with API docs—retrieval precision was 22% better than 1,200-character chunks.

Business Reports: 800-1,200 characters Balance between context and precision. Large enough to include supporting details. We settled on 1,000 characters for most business content after testing showed it hit the sweet spot.

Legal Documents: 1,200-1,500 characters Complex language needs more context to be understood. Longer chunks prevent misinterpretation. We tried 800-character chunks for contracts and saw a 30% increase in "I need more context" feedback.

Marketing Content: 300-600 characters Short, punchy paragraphs. Smaller chunks match the content structure. One client's blog archive retrieves best at 400 characters per chunk.

These are starting points, not rules. Test with your specific documents and adjust. We typically run a 2-week A/B test with different chunk sizes before committing to production.

Measuring Chunking Quality

How do you know if your chunking strategy works? Test it:

  1. Retrieval Accuracy: Do relevant chunks surface for test queries?
  2. Context Completeness: Can questions be answered from the chunks alone?
  3. Coherence: Do chunks make sense in isolation?
  4. Coverage: Does each chunk contain enough unique information?

Create a test set of queries with known good answers. Measure how often the right chunks are retrieved and rank highly.

We maintain a "golden set" of 200 test queries for each system we build. Any chunking change must maintain or improve performance on this set before going to production.

Common Mistakes

Chunks Too Small You lose context. The AI can't understand what pronouns refer to or why something matters. Answers become superficial.

We once chunked a technical manual at 200 characters per chunk. The system couldn't answer basic questions because critical context was always in a different chunk. Increasing to 600 characters fixed most issues.

Chunks Too Large You retrieve irrelevant information alongside what's needed. The AI gets distracted by tangential details. Answers become unfocused.

At 2,000 characters per chunk, one system started including tangential information in 45% of answers. Users complained about verbosity. Dropping to 1,000 characters cut that to 12%.

Ignoring Document Structure Tables, lists, code blocks—these need special handling. Generic chunking strategies break formatted content.

We built a system that split code examples across chunks. Developers hated it. We added logic to keep code blocks intact, even if it meant variable chunk sizes. Problem solved.

No Overlap Information at boundaries gets lost. Key facts get split awkwardly. Retrieval suffers.

Adding 20% overlap to one system reduced "incomplete answer" complaints by 40%. It's one of the highest ROI improvements we've found.

Forgetting Metadata Without metadata, you can't filter by source, date, or type. You can't provide good citations. Users can't assess credibility.

Advanced Techniques

Question-Aware Chunking Some systems maintain multiple chunk sets optimized for different query types. Factual questions use small, precise chunks. Conceptual questions use larger, context-rich chunks.

We implemented this for one client. It increased system complexity significantly but improved user satisfaction from 3.6 to 4.3. Worth it for high-value applications.

Dynamic Chunking Adjust chunk size based on information density. Dense sections get smaller chunks; sparse sections get larger ones.

Cross-References Link related chunks explicitly. When retrieving chunk A, also consider chunks that reference similar concepts.

We added cross-referencing to a knowledge base with heavily interconnected topics. Retrieval accuracy improved 18% on complex queries that required understanding relationships between concepts.

The Testing Loop

Chunking isn't set-it-and-forget-it. As your document collection grows, revisit your strategy:

  1. Monitor retrieval quality metrics
  2. Review failed queries—what went wrong?
  3. Test alternative chunking approaches on problem documents
  4. Deploy improvements incrementally
  5. Measure impact on user satisfaction

Good chunking improves over time as you learn from real usage patterns.

We review chunking performance quarterly for most systems. Each review typically yields 2-3 improvements that collectively improve metrics 5-10%.

Practical Recommendations

Start here:

  1. Begin with paragraph-based chunking at 800-1,000 characters
  2. Add 100-200 character overlap between chunks
  3. Include rich metadata with every chunk
  4. Test on real queries from your users
  5. Iterate based on metrics and feedback

As you scale:

  1. Customize by document type if you have clear categories
  2. Consider hierarchical approaches for complex documents
  3. Invest in semantic chunking where quality justifies the cost

We typically start with the basic approach and iterate. Most systems improve 20-30% on retrieval metrics through iterative refinement over 2-3 months.

The Real Goal

Perfect chunking doesn't exist. Your documents are too varied, your queries too diverse, your users' needs too specific.

The goal is chunks that are good enough for most queries, while failing gracefully on edge cases. When retrieval fails, make it obvious so users can try different queries.

Focus on the common cases. Make those work really well. For everything else, provide escape hatches—search by document title, browse by date, filter by type.

Our best systems handle 85-90% of queries well on the first try. The remaining 10-15% need refinement or manual help. That's realistic and users accept it.

Why This Matters

Chunking is invisible when it works well and frustrating when it doesn't. Users don't think about chunks—they just want good answers.

Your job is to make chunking good enough that nobody has to think about it. That's the art: the careful choices that make the whole system feel natural and effortless.

Get chunking right, and everything else gets easier.