Markitdown Document Parsers¶

Langroid integrates with Microsoft's Markitdown library to provide conversion of Microsoft Office documents to markdown format. Three specialized parsers are available, for docx, xlsx, and pptx files.

Prerequisites¶

To use these parsers, install Langroid with the required extras:

pip install "langroid[markitdown]"    # Just Markitdown parsers
# or
pip install "langroid[doc-parsers]"   # All document parsers

Available Parsers¶

Once you set up a parser for the appropriate document-type, you
can get the entire document with parser.get_doc(), or get automatically chunked content with parser.get_doc_chunks().

1. `MarkitdownDocxParser`¶

Converts Word documents (*.docx) to markdown, preserving structure, formatting, and tables.

See the tests

for examples of how to use these parsers.

from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import DocxParsingConfig, ParsingConfig

parser = DocumentParser.create(
    "path/to/document.docx",
    ParsingConfig(
        docx=DocxParsingConfig(library="markitdown-docx"),
        # ... other parsing config options
    ),
)

2. `MarkitdownXLSXParser`¶

Converts Excel spreadsheets (.xlsx/.xls) to markdown tables, preserving data and sheet structure.

from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import ParsingConfig, MarkitdownXLSParsingConfig

parser = DocumentParser.create(
    "path/to/spreadsheet.xlsx",
    ParsingConfig(xls=MarkitdownXLSParsingConfig())
)

3. `MarkitdownPPTXParser`¶

Converts PowerPoint presentations (*.pptx) to markdown, preserving slide content and structure.

from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import ParsingConfig, MarkitdownPPTXParsingConfig

parser = DocumentParser.create(
    "path/to/presentation.pptx",
    ParsingConfig(pptx=MarkitdownPPTXParsingConfig())
)

Markitdown Document Parsers¶

Prerequisites¶

Available Parsers¶

1. MarkitdownDocxParser¶

2. MarkitdownXLSXParser¶

3. MarkitdownPPTXParser¶

1. `MarkitdownDocxParser`¶

2. `MarkitdownXLSXParser`¶

3. `MarkitdownPPTXParser`¶