document_parser
langroid/parsing/document_parser.py
DocumentParser(source, config)
¶
Bases: Parser
Abstract base class for extracting text from special types of docs such as PDFs or Docx.
Attributes:
Name | Type | Description |
---|---|---|
source |
str
|
The source, either a URL or a file path. |
doc_bytes |
BytesIO
|
BytesIO object containing the doc data. |
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
create(source, config, doc_type=None)
classmethod
¶
Create a DocumentParser instance based on source type
and config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
str | bytes
|
The source, could be a URL, file path, or bytes object. |
required |
config
|
ParserConfig
|
The parser configuration. |
required |
doc_type
|
str | None
|
The type of document, if known |
None
|
Returns:
Name | Type | Description |
---|---|---|
DocumentParser |
'DocumentParser'
|
An instance of a DocumentParser subclass. |
Source code in langroid/parsing/document_parser.py
chunks_from_path_or_bytes(source, parser, doc_type=None, lines=None)
staticmethod
¶
Get document chunks from a file path or bytes object.
Args:
source (str|bytes): The source, which could be a URL, path or bytes object.
parser (Parser): The parser instance (for splitting the document).
doc_type (str|DocumentType|None): The type of document, if known.
lines (int|None): The number of lines to read from a plain text file.
Returns:
List[Document]: A list of Document
objects,
each containing a chunk of text, determined by the
chunking and splitting settings in the parser config.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
get_document_from_page(page)
¶
Get Langroid Document object (with possible metadata) corresponding to a given page.
fix_text(text)
¶
Fix text extracted from a PDF.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The extracted text. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The fixed text. |
Source code in langroid/parsing/document_parser.py
get_doc()
¶
Get entire text from source as a single document.
Returns:
Type | Description |
---|---|
Document
|
a |
Source code in langroid/parsing/document_parser.py
get_doc_chunks()
¶
Get document chunks from a pdf source, with page references in the document metadata.
Adapted from https://github.com/whitead/paper-qa/blob/main/paperqa/readers.py
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: a list of |
Source code in langroid/parsing/document_parser.py
FitzPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the fitz
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using fitz
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, 'fitz.Page'], None, None]
|
Generator[fitz.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given fitz
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
Page
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
PyMuPDF4LLMParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the pymupdf4llm
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using fitz
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, 'fitz.Page'], None, None]
|
Generator[fitz.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object corresponding to a given "page-chunk" dictionary, see: https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/api.html
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
Dict[str, Any]
|
The "page-chunk" dictionary. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
DoclingParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the docling
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using docling
.
Code largely from this example:
https://github.com/DS4SD/docling/blob/4d41db3f7abb86c8c65386bf94e7eb0bf22bb82b/docs/examples/export_figures.py
Returns:
Type | Description |
---|---|
Generator[Tuple[int, Any], None, None]
|
Generator[docling.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_file)
¶
Get Document object from a given 1-page markdown file, possibly containing image refs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_file
|
str
|
The markdown file path for the page. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
PyPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the pypdf
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using pypdf
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, PageObject], None, None]
|
Generator[pypdf.pdf.PageObject]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given pypdf
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
PageObject
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
ImagePdfParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs that are images, i.e. not "true" PDFs.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object corresponding to a given pdf2image
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
Image
|
The PIL Image object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
UnstructuredPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDF files using the unstructured
library.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given unstructured
element.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
unstructured element
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
UnstructuredDocxParser(source, config)
¶
Bases: DocumentParser
Parser for processing DOCX files using the unstructured
library.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given unstructured
element.
Note
The concept of "pages" doesn't actually exist in the .docx file format in the same way it does in formats like .pdf. A .docx file is made up of a series of elements like paragraphs and tables, but the division into pages is done dynamically based on the rendering settings (like the page size, margin size, font size, etc.).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
unstructured element
|
The |
required |
Returns:
Type | Description |
---|---|
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
PythonDocxParser(source, config)
¶
Bases: DocumentParser
Parser for processing DOCX files using the python-docx
library.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Simulate iterating through pages. In a DOCX file, pages are not explicitly defined, so we consider each paragraph as a separate 'page' for simplicity.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given 'page', which in this case is a single paragraph.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
list
|
A list containing a single Paragraph object. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
MarkitdownDocxParser(source, config)
¶
Bases: DocumentParser
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_content)
¶
Get Document object from a given markdown section.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_content
|
str
|
The markdown content for the section. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
MarkitdownXLSXParser(source, config)
¶
Bases: DocumentParser
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_content)
¶
Get Document object from a given 1-page markdown string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_content
|
str
|
The markdown content for the page. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
MarkitdownPPTXParser(source, config)
¶
Bases: DocumentParser
Source code in langroid/parsing/document_parser.py
get_document_from_page(md_content)
¶
Get Document object from a given 1-page markdown string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_content
|
str
|
The markdown content for the page. |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
GeminiPdfParser(source, config)
¶
Bases: DocumentParser
This class converts PDFs to Markdown using Gemini multimodal LLMs.
It extracts pages, converts them with the LLM (replacing images with
detailed descriptions), and outputs Markdown page by page. The
conversion follows GEMINI_SYSTEM_INSTRUCTION
. It employs
multiprocessing for speed, async requests with rate limiting, and
handles errors.
It supports page-by-page splitting or chunking multiple pages into
one, respecting page boundaries and a max_token_limit
.
Source code in langroid/parsing/document_parser.py
max_tokens = config.pdf.gemini_config.max_tokens or self.DEFAULT_MAX_TOKENS
instance-attribute
¶
If True, each PDF page is processed as a separate chunk,
resulting in one LLM request per page. If False, pages are
grouped into chunks based on max_token_limit
before being sent
to the LLM.
requests_per_minute = config.pdf.gemini_config.requests_per_minute or 5
instance-attribute
¶
A semaphore to control the number of concurrent requests to the LLM, preventing rate limit errors. A semaphore slot is acquired before making an LLM request and released after the request is complete.
process_chunks(chunks, api_key)
async
¶
Processes PDF chunks by sending them to the Gemini API and collecting the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks
|
List[Dict[str, Any]]
|
A list of dictionaries, where each dictionary represents a PDF chunk and contains the PDF data and page numbers. |
required |
api_key
|
str
|
The Gemini API key. |
required |
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Iterates over the document pages, extracting content using the Gemini API, saves them to a markdown file, and yields page numbers along with their corresponding content.
Yields:
Type | Description |
---|---|
int
|
A generator of tuples, where each tuple contains the page number |
Any
|
(int) and the page content (Any). |
Source code in langroid/parsing/document_parser.py
1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 |
|
get_document_from_page(page)
¶
Get a Document object from a given markdown page.
MarkerPdfParser(source, config)
¶
Bases: DocumentParser
Parse PDF files using the marker
library: https://github.com/VikParuchuri/marker
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using marker
.
Source code in langroid/parsing/document_parser.py
get_document_from_page(page)
¶
Get Document object from a given 1-page markdown file, possibly containing image refs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page
|
str
|
The page we get by splitting large md file from |
required |
Returns:
Name | Type | Description |
---|---|---|
Document |
Document
|
Document object, with content and possible metadata. |
Source code in langroid/parsing/document_parser.py
find_last_full_char(possible_unicode)
¶
Find the index of the last full character in a byte string. Args: possible_unicode (bytes): The bytes to check. Returns: int: The index of the last full unicode character.
Source code in langroid/parsing/document_parser.py
is_plain_text(path_or_bytes)
¶
Check if a file is plain text by attempting to decode it as UTF-8. Args: path_or_bytes (str|bytes): The file path or bytes object. Returns: bool: True if the file is plain text, False otherwise.