Saturday, October 5, 2024

The Art of Document Chunking for LLM Applications

Abdellatif Abdelfattah

The Art of Document Chunking for LLM Applications

When building document-aware AI applications, one of the most consequential yet under-appreciated technical decisions is how to divide documents into smaller pieces or "chunks." This process, known as chunking, can dramatically impact retrieval quality, context relevance, and ultimately the performance of your AI system.

Why Chunking Matters

Large language models (LLMs) have context length limitations. Even with models that support tens of thousands of tokens, we often need to select the most relevant portions of documents to include in a prompt. Effective chunking helps ensure that:

  1. Retrieved information is coherent: Each chunk contains complete thoughts or ideas
  2. Relevant content stays together: Related information isn't split across different chunks
  3. Context isn't lost: Important surrounding context is preserved
  4. Retrieval is precise: Chunks are specific enough to target exact information needs

Different Chunking Strategies

There are several approaches to chunking, each with trade-offs in complexity, information preservation, and retrieval effectiveness:

1. Fixed-Size Chunking

The simplest approach: divide text into chunks of a fixed number of tokens or characters.

def fixed_size_chunking(text, chunk_size=1000, overlap=0):
    """
    Split text into chunks of fixed size with optional overlap
    """
    tokens = text.split()  # Simple tokenization by whitespace
    chunks = []
    
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = ' '.join(tokens[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

Pros: Simple to implement, predictable chunk sizes Cons: May cut across natural document boundaries, creating incoherent chunks

2. Sentence-Based Chunking

A more context-aware approach that respects sentence boundaries:

import re

def sentence_based_chunking(text, max_sentences=10, overlap=2):
    """
    Split text into chunks of a maximum number of sentences with overlap
    """
    # Simple sentence splitting (for demonstration purposes)
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    
    for i in range(0, len(sentences), max_sentences - overlap):
        chunk = ' '.join(sentences[i:i + max_sentences])
        chunks.append(chunk)
    
    return chunks

Pros: Preserves sentence integrity Cons: Sentences can vary greatly in length, resulting in uneven chunks

3. Paragraph-Based Chunking

Preserves paragraph structure, which often encapsulates complete ideas:

def paragraph_based_chunking(text, max_paragraphs=3, overlap=1):
    """
    Split text into chunks of a maximum number of paragraphs with overlap
    """
    paragraphs = text.split('\n\n')  # Simple paragraph splitting
    chunks = []
    
    for i in range(0, len(paragraphs), max_paragraphs - overlap):
        chunk = '\n\n'.join(paragraphs[i:i + max_paragraphs])
        chunks.append(chunk)
    
    return chunks

Pros: Preserves paragraph integrity, which often maps to coherent ideas Cons: Paragraphs can vary dramatically in size

4. Semantic Chunking

The most sophisticated approach, which splits documents based on semantic coherence:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def get_embedding(text):
    """Get embeddings for text using OpenAI's API"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def semantic_chunking(text, max_chunk_size=1000, similarity_threshold=0.7):
    """
    Split text into semantically coherent chunks
    """
    # First split into smaller units (e.g., paragraphs)
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = paragraphs[0]
    current_embedding = get_embedding(current_chunk)
    
    for i in range(1, len(paragraphs)):
        # Check if adding next paragraph would exceed max size
        if len(current_chunk) + len(paragraphs[i]) > max_chunk_size:
            chunks.append(current_chunk)
            current_chunk = paragraphs[i]
            current_embedding = get_embedding(current_chunk)
            continue
        
        # Check semantic similarity
        next_embedding = get_embedding(paragraphs[i])
        similarity = cosine_similarity([current_embedding], [next_embedding])[0][0]
        
        if similarity >= similarity_threshold:
            # Semantically similar, add to current chunk
            current_chunk += "\n\n" + paragraphs[i]
            # Update embedding for the growing chunk
            current_embedding = get_embedding(current_chunk)
        else:
            # Not similar enough, start a new chunk
            chunks.append(current_chunk)
            current_chunk = paragraphs[i]
            current_embedding = next_embedding
    
    # Add the last chunk
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks

Pros: Creates chunks with semantically related content Cons: Computationally expensive, requires embedding generation

Hybrid Approaches

In practice, the best chunking strategies often combine multiple approaches:

def hybrid_chunking(text, max_size=1000, overlap=100):
    """
    A hybrid approach that respects both semantic units and size constraints
    """
    # First split by paragraphs
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = ""
    
    for paragraph in paragraphs:
        # If adding this paragraph would exceed max size and we already have content,
        # finalize the current chunk
        if current_chunk and len(current_chunk) + len(paragraph) > max_size:
            chunks.append(current_chunk)
            
            # Start new chunk with overlap from the end of previous chunk
            if overlap > 0 and len(current_chunk) > overlap:
                # Try to find a sentence boundary for clean overlap
                overlap_text = current_chunk[-overlap:]
                last_sentence_end = max(
                    overlap_text.rfind('. '),
                    overlap_text.rfind('! '),
                    overlap_text.rfind('? ')
                )
                
                if last_sentence_end != -1:
                    # Found a sentence boundary, start with complete sentences
                    current_chunk = current_chunk[-(overlap - last_sentence_end):] + "\n\n"
                else:
                    # No clean sentence boundary, use raw overlap
                    current_chunk = current_chunk[-overlap:] + "\n\n"
            else:
                current_chunk = ""
        
        # Add paragraph to current chunk
        current_chunk += paragraph + "\n\n"
        
        # If this single paragraph exceeds max size, we need to split it by sentences
        if len(current_chunk) > max_size:
            sentences = re.split(r'(?<=[.!?])\s+', current_chunk)
            temp_chunk = ""
            
            for sentence in sentences:
                if len(temp_chunk) + len(sentence) > max_size and temp_chunk:
                    chunks.append(temp_chunk)
                    temp_chunk = sentence + " "
                else:
                    temp_chunk += sentence + " "
            
            current_chunk = temp_chunk
    
    # Add the final chunk if it has content
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks

This hybrid approach tries to respect paragraph boundaries while ensuring chunks don't exceed a maximum size, with intelligent overlap handling.

Metadata-Enhanced Chunking

For structured documents like research papers, legal contracts, or technical documentation, preserving metadata during chunking can significantly improve retrieval:

def metadata_enhanced_chunking(document):
    """
    Split document while preserving metadata about its structure
    """
    chunks = []
    
    for section in document.sections:
        section_chunks = []
        paragraphs = section.content.split('\n\n')
        
        # Chunk the section content
        current_chunk = ""
        for paragraph in paragraphs:
            if len(current_chunk) + len(paragraph) > 1000 and current_chunk:
                section_chunks.append(current_chunk)
                current_chunk = paragraph + "\n\n"
            else:
                current_chunk += paragraph + "\n\n"
        
        if current_chunk:
            section_chunks.append(current_chunk)
        
        # Add metadata to each chunk
        for i, chunk in enumerate(section_chunks):
            chunks.append({
                "content": chunk,
                "metadata": {
                    "document_id": document.id,
                    "document_title": document.title,
                    "section_title": section.title,
                    "section_number": section.number,
                    "chunk_index": i,
                    "total_chunks_in_section": len(section_chunks)
                }
            })
    
    return chunks

This approach attaches metadata about document structure, making it possible to retrieve specific sections or understand the context of a chunk within the broader document.

Evaluating Chunking Quality

How do you know if your chunking strategy is effective? Consider these evaluation metrics:

  1. Retrieval accuracy: Do queries return the most relevant chunks?
  2. Context completeness: Do chunks contain all information needed to answer a question?
  3. Coherence: Are chunks self-contained and understandable?
  4. Information density: Do chunks contain a high signal-to-noise ratio?

A simple evaluation function might look like:

def evaluate_chunking(chunks, test_queries, ground_truth):
    """
    Evaluate chunking quality using test queries and known relevant content
    """
    results = {}
    
    for query, relevant_content in zip(test_queries, ground_truth):
        # Embed query
        query_embedding = get_embedding(query)
        
        # Find most similar chunk
        similarities = []
        for i, chunk in enumerate(chunks):
            chunk_embedding = get_embedding(chunk)
            similarity = cosine_similarity([query_embedding], [chunk_embedding])[0][0]
            similarities.append((i, similarity))
        
        # Sort by similarity descending
        similarities.sort(key=lambda x: x[1], reverse=True)
        top_chunk_idx = similarities[0][0]
        top_chunk = chunks[top_chunk_idx]
        
        # Check if relevant content is in the top chunk
        relevant_found = relevant_content in top_chunk
        
        results[query] = {
            "relevant_found": relevant_found,
            "top_chunk_idx": top_chunk_idx,
            "similarity_score": similarities[0][1]
        }
    
    # Calculate overall metrics
    found_rate = sum(1 for r in results.values() if r["relevant_found"]) / len(results)
    avg_similarity = sum(r["similarity_score"] for r in results.values()) / len(results)
    
    return {
        "found_rate": found_rate,
        "avg_similarity": avg_similarity,
        "detailed_results": results
    }

Best Practices

Based on extensive experimentation, here are some generally effective chunking practices:

  1. Respect document structure whenever possible (sections, paragraphs)
  2. Use overlapping chunks to prevent information loss at boundaries
  3. Include document metadata in each chunk for context
  4. Adapt chunk size to your specific document types
  5. Consider hierarchical chunking for navigating different levels of detail
  6. Test multiple strategies against your specific use case

Conclusion

Document chunking is both art and science. The optimal approach depends on your document types, retrieval goals, and computational constraints. By understanding the trade-offs between different chunking strategies and implementing a thoughtful approach, you can significantly improve the performance of your document-aware AI applications.

Remember that chunking is just one part of a broader retrieval system. For best results, pair effective chunking with high-quality embeddings, intelligent retrieval techniques, and robust result reranking.