Skip to content

Document Chunking/Splitting in Langroid

Langroid's [ParsingConfig][langroid.parsing.parser.ParsingConfig] provides several document chunking strategies through the Splitter enum:

1. MARKDOWN (Splitter.MARKDOWN) (The default)

Purpose: Structure-aware splitting that preserves markdown formatting.

How it works:

  • Preserves document hierarchy (headers and sections)
  • Enriches chunks with header information
  • Uses word count instead of token count (with adjustment factor)
  • Supports "rollup" to maintain document structure
  • Ideal for markdown documents where preserving formatting is important

2. TOKENS (Splitter.TOKENS)

Purpose: Creates chunks of approximately equal token size.

How it works:

  • Tokenizes the text using tiktoken
  • Aims for chunks of size chunk_size tokens (default: 200)
  • Looks for natural breakpoints like punctuation or newlines
  • Prefers splitting at sentence/paragraph boundaries
  • Ensures chunks are at least min_chunk_chars long (default: 350)

3. PARA_SENTENCE (Splitter.PARA_SENTENCE)

Purpose: Splits documents respecting paragraph and sentence boundaries.

How it works:

  • Recursively splits documents until chunks are below 1.3× the target size
  • Maintains document structure by preserving natural paragraph breaks
  • Adjusts chunk boundaries to avoid cutting in the middle of sentences
  • Stops when it can't split chunks further without breaking coherence

4. SIMPLE (Splitter.SIMPLE)

Purpose: Basic splitting using predefined separators.

How it works:

  • Uses a list of separators to split text (default: ["\n\n", "\n", " ", ""])
  • Splits on the first separator in the list
  • Doesn't attempt to balance chunk sizes
  • Simplest and fastest splitting method

Basic Configuration

from langroid.parsing.parser import ParsingConfig, Splitter

config = ParsingConfig(
    splitter=Splitter.MARKDOWN,  # Most feature-rich option
    chunk_size=200,              # Target tokens per chunk
    chunk_size_variation=0.30,   # Allowed variation from target
    overlap=50,                  # Token overlap between chunks
    token_encoding_model="text-embedding-3-small"
)

Format-Specific Configuration

# Customize PDF parsing
config = ParsingConfig(
    splitter=Splitter.PARA_SENTENCE,
    pdf=PdfParsingConfig(
        library="pymupdf4llm"  # Default PDF parser
    )
)

# Use Gemini for PDF parsing
config = ParsingConfig(
    pdf=PdfParsingConfig(
        library="gemini",
        gemini_config=GeminiConfig(
            model_name="gemini-2.0-flash",
            requests_per_minute=5
        )
    )
)

Setting Up Parsing Config in DocChatAgentConfig

You can configure document parsing when creating a DocChatAgent by customizing the parsing field within the DocChatAgentConfig. Here's how to do it:

from langroid.agent.special.doc_chat_agent import DocChatAgentConfig  
from langroid.parsing.parser import ParsingConfig, Splitter, PdfParsingConfig

# Create a DocChatAgent with custom parsing configuration
agent_config = DocChatAgentConfig(
    parsing=ParsingConfig(
        # Choose the splitting strategy
        splitter=Splitter.MARKDOWN,  # Structure-aware splitting with header context

        # Configure chunk sizes
        chunk_size=800,              # Target tokens per chunk
        overlap=150,                 # Overlap between chunks

        # Configure chunk behavior
        max_chunks=5000,             # Maximum number of chunks to create
        min_chunk_chars=250,         # Minimum characters when truncating at punctuation
        discard_chunk_chars=10,      # Discard chunks smaller than this

        # Configure context window
        n_neighbor_ids=3,            # Store 3 chunk IDs on either side

        # Configure PDF parsing specifically
        pdf=PdfParsingConfig(
            library="pymupdf4llm",   # Choose PDF parsing library
        )
    )
)