Markitdown Document Parsers¶
Langroid integrates with Microsoft's Markitdown library to provide
conversion of Microsoft Office documents to markdown format.
Three specialized parsers are available, for docx, xlsx, and pptx files.
Prerequisites¶
To use these parsers, install Langroid with the required extras:
pip install "langroid[markitdown]" # Just Markitdown parsers
# or
pip install "langroid[doc-parsers]" # All document parsers
Available Parsers¶
Once you set up a parser for the appropriate document-type, you
can get the entire document with parser.get_doc(),
or get automatically chunked content with parser.get_doc_chunks().
1. MarkitdownDocxParser¶
Converts Word documents (*.docx) to markdown, preserving structure,
formatting, and tables.
See the tests
for examples of how to use these parsers.
from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import DocxParsingConfig, ParsingConfig
parser = DocumentParser.create(
"path/to/document.docx",
ParsingConfig(
docx=DocxParsingConfig(library="markitdown-docx"),
# ... other parsing config options
),
)
2. MarkitdownXLSXParser¶
Converts Excel spreadsheets (.xlsx/.xls) to markdown tables, preserving data and sheet structure.
from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import ParsingConfig, MarkitdownXLSParsingConfig
parser = DocumentParser.create(
"path/to/spreadsheet.xlsx",
ParsingConfig(xls=MarkitdownXLSParsingConfig())
)
3. MarkitdownPPTXParser¶
Converts PowerPoint presentations (*.pptx) to markdown, preserving slide content and structure.