Markitdown Document Parsers¶
Langroid integrates with Microsoft's Markitdown library to provide
conversion of Microsoft Office documents to markdown format.
Three specialized parsers are available, for docx
, xlsx
, and pptx
files.
Prerequisites¶
To use these parsers, install Langroid with the required extras:
pip install "langroid[markitdown]" # Just Markitdown parsers
# or
pip install "langroid[doc-parsers]" # All document parsers
Available Parsers¶
Once you set up a parser
for the appropriate document-type, you
can get the entire document with parser.get_doc()
,
or get automatically chunked content with parser.get_doc_chunks()
.
1. MarkitdownDocxParser
¶
Converts Word documents (*.docx
) to markdown, preserving structure,
formatting, and tables.
See the tests
for examples of how to use these parsers.
from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import DocxParsingConfig, ParsingConfig
parser = DocumentParser.create(
"path/to/document.docx",
ParsingConfig(
docx=DocxParsingConfig(library="markitdown-docx"),
# ... other parsing config options
),
)
2. MarkitdownXLSXParser
¶
Converts Excel spreadsheets (.xlsx/.xls) to markdown tables, preserving data and sheet structure.
from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import ParsingConfig, MarkitdownXLSParsingConfig
parser = DocumentParser.create(
"path/to/spreadsheet.xlsx",
ParsingConfig(xls=MarkitdownXLSParsingConfig())
)
3. MarkitdownPPTXParser
¶
Converts PowerPoint presentations (*.pptx) to markdown, preserving slide content and structure.