document_parser
langroid/parsing/document_parser.py
DocumentParser(source, config)
¶
Bases: Parser
Abstract base class for extracting text from special types of docs such as PDFs or Docx.
Attributes:
Name | Type | Description |
---|---|---|
source |
str
|
The source, either a URL or a file path. |
doc_bytes |
BytesIO
|
BytesIO object containing the doc data. |
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
create(source, config, doc_type=None)
classmethod
¶
Create a DocumentParser instance based on source type
and config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
str | bytes
|
The source, could be a URL, file path, or bytes object. |
required |
config
|
ParserConfig
|
The parser configuration. |
required |
doc_type
|
str | None
|
The type of document, if known |
None
|
Returns:
Name | Type | Description |
---|---|---|
DocumentParser |
'DocumentParser'
|
An instance of a DocumentParser subclass. |
Source code in langroid/parsing/document_parser.py
chunks_from_path_or_bytes(source, parser, doc_type=None, lines=None)
staticmethod
¶
Get document chunks from a file path or bytes object.
Args:
source (str|bytes): The source, which could be a URL, path or bytes object.
parser (Parser): The parser instance (for splitting the document).
doc_type (str|DocumentType|None): The type of document, if known.
lines (int|None): The number of lines to read from a plain text file.
Returns:
List[Document]: A list of Document
objects,
each containing a chunk of text, determined by the
chunking and splitting settings in the parser config.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
get_document_from_page(page)
¶
Get Langroid Document object (with possible metadata) corresponding to a given page.
fix_text(text)
¶
Fix text extracted from a PDF.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The extracted text. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The fixed text. |
Source code in langroid/parsing/document_parser.py
get_doc()
¶
Get entire text from source as a single document.
Returns:
Type | Description |
---|---|
Document
|
a |
Source code in langroid/parsing/document_parser.py
get_doc_chunks()
¶
Get document chunks from a pdf source, with page references in the document metadata.
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: a list of |
Source code in langroid/parsing/document_parser.py
FitzPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the fitz
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using fitz
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, 'fitz.Page'], None, None]
|
Generator[fitz.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given fitz
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
Page
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
PyMuPDF4LLMParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the pymupdf4llm
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using fitz
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, 'fitz.Page'], None, None]
|
Generator[fitz.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object corresponding to a given "page-chunk" dictionary, see: https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/api.html
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
Dict[str, Any]
|
The "page-chunk" dictionary. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
DoclingParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the docling
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using docling
.
Code largely from this example:
https://github.com/DS4SD/docling/blob/4d41db3f7abb86c8c65386bf94e7eb0bf22bb82b/docs/examples/export_figures.py
Returns:
Type | Description |
---|---|
Generator[Tuple[int, Any], None, None]
|
Generator[docling.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_file)
¶
Get Document object from a given 1-page markdown file, possibly containing image refs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_file
|
str
|
The markdown file path for the page. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
PyPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the pypdf
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using pypdf
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, PageObject], None, None]
|
Generator[pypdf.pdf.PageObject]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given pypdf
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
PageObject
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
ImagePdfParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs that are images, i.e. not "true" PDFs.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object corresponding to a given pdf2image
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
Image
|
The PIL Image object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
UnstructuredPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDF files using the unstructured
library.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given unstructured
element.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
unstructured element
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
UnstructuredDocxParser(source, config)
¶
Bases: DocumentParser
Parser for processing DOCX files using the unstructured
library.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given unstructured
element.
Note
The concept of "pages" doesn't actually exist in the .docx file format in the same way it does in formats like .pdf. A .docx file is made up of a series of elements like paragraphs and tables, but the division into pages is done dynamically based on the rendering settings (like the page size, margin size, font size, etc.).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
unstructured element
|
The |
required |
Returns:
Type | Description |
---|---|
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
PythonDocxParser(source, config)
¶
Bases: DocumentParser
Parser for processing DOCX files using the python-docx
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Simulate iterating through pages. In a DOCX file, pages are not explicitly defined, so we consider each paragraph as a separate 'page' for simplicity.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given 'page', which in this case is a single paragraph.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
list
|
A list containing a single Paragraph object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
MarkitdownDocxParser(source, config)
¶
Bases: DocumentParser
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_content)
¶
Get Document object from a given markdown section.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_content
|
str
|
The markdown content for the section. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
MarkitdownXLSXParser(source, config)
¶
Bases: DocumentParser
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_content)
¶
Get Document object from a given 1-page markdown string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_content
|
str
|
The markdown content for the page. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
MarkitdownPPTXParser(source, config)
¶
Bases: DocumentParser
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_content)
¶
Get Document object from a given 1-page markdown string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_content
|
str
|
The markdown content for the page. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
LLMPdfParser(source, config)
¶
Bases: DocumentParser
This class converts PDFs to Markdown using multimodal LLMs.
It extracts pages, converts them with the LLM (replacing images with
detailed descriptions), and outputs Markdown page by page. The
conversion follows LLM_PDF_MD_SYSTEM_INSTRUCTION
. It employs
multiprocessing for speed, async requests with rate limiting, and
handles errors.
It supports page-by-page splitting or chunking multiple pages into
one, respecting page boundaries and a max_token_limit
.
Source code in langroid/parsing/document_parser.py
max_tokens = self.llm_parser_config.max_tokens or self.DEFAULT_MAX_TOKENS
instance-attribute
¶
If True, each PDF page is processed as a separate chunk,
resulting in one LLM request per page. If False, pages are
grouped into chunks based on max_token_limit
before being sent
to the LLM.
requests_per_minute = self.llm_parser_config.requests_per_minute or 5
instance-attribute
¶
A semaphore to control the number of concurrent requests to the LLM, preventing rate limit errors. A semaphore slot is acquired before making an LLM request and released after the request is complete.
process_chunks(chunks)
async
¶
Processes PDF chunks by sending them to the LLM API and collecting the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks
|
List[Dict[str, Any]]
|
A list of dictionaries, where each dictionary represents a PDF chunk and contains the PDF data and page numbers. |
required |
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Iterates over the document pages, extracting content using the LLM API, saves them to a markdown file, and yields page numbers along with their corresponding content.
Yields:
Type | Description |
---|---|
int
|
A generator of tuples, where each tuple contains the page number |
Any
|
(int) and the page content (Any). |
Source code in langroid/parsing/document_parser.py
1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 |
|
get_document_from_page(page)
¶
Get a Document object from a given markdown page.
MarkerPdfParser(source, config)
¶
Bases: DocumentParser
Parse PDF files using the marker
library: https://github.com/VikParuchuri/marker
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using marker
.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given 1-page markdown file, possibly containing image refs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
str
|
The page we get by splitting large md file from |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
find_last_full_char(possible_unicode)
¶
Find the index of the last full character in a byte string. Args: possible_unicode (bytes): The bytes to check. Returns: int: The index of the last full unicode character.
Source code in langroid/parsing/document_parser.py
is_plain_text(path_or_bytes)
¶
Check if a file is plain text by attempting to decode it as UTF-8. Args: path_or_bytes (str|bytes): The file path or bytes object. Returns: bool: True if the file is plain text, False otherwise.