document_parser
langroid/parsing/document_parser.py
DocumentParser(source, config)
¶
Bases: Parser
Abstract base class for extracting text from special types of docs such as PDFs or Docx.
Attributes:
Name | Type | Description |
---|---|---|
source |
str
|
The source, either a URL or a file path. |
doc_bytes |
BytesIO
|
BytesIO object containing the doc data. |
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
create(source, config, doc_type=None)
classmethod
¶
Create a DocumentParser instance based on source type
and config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source |
str | bytes
|
The source, could be a URL, file path, or bytes object. |
required |
config |
ParserConfig
|
The parser configuration. |
required |
doc_type |
str | None
|
The type of document, if known |
None
|
Returns:
Name | Type | Description |
---|---|---|
DocumentParser |
'DocumentParser'
|
An instance of a DocumentParser subclass. |
Source code in langroid/parsing/document_parser.py
chunks_from_path_or_bytes(source, parser, doc_type=None, lines=None)
staticmethod
¶
Get document chunks from a file path or bytes object.
Args:
source (str|bytes): The source, which could be a URL, path or bytes object.
parser (Parser): The parser instance (for splitting the document).
doc_type (str|DocumentType|None): The type of document, if known.
lines (int|None): The number of lines to read from a plain text file.
Returns:
List[Document]: A list of Document
objects,
each containing a chunk of text, determined by the
chunking and splitting settings in the parser config.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
extract_text_from_page(page)
¶
fix_text(text)
¶
Fix text extracted from a PDF.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The extracted text. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The fixed text. |
Source code in langroid/parsing/document_parser.py
get_doc()
¶
Get entire text from source as a single document.
Returns:
Type | Description |
---|---|
Document
|
a |
Source code in langroid/parsing/document_parser.py
get_doc_chunks()
¶
Get document chunks from a pdf source, with page references in the document metadata.
Adapted from https://github.com/whitead/paper-qa/blob/main/paperqa/readers.py
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: a list of |
Source code in langroid/parsing/document_parser.py
FitzPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the fitz
library.
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using fitz
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, 'fitz.Page'], None, None]
|
Generator[fitz.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
extract_text_from_page(page)
¶
Extract text from a given fitz
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
Page
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the page. |
Source code in langroid/parsing/document_parser.py
PyPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the pypdf
library.
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using pypdf
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, PageObject], None, None]
|
Generator[pypdf.pdf.PageObject]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
extract_text_from_page(page)
¶
Extract text from a given pypdf
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
PageObject
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the page. |
Source code in langroid/parsing/document_parser.py
PDFPlumberParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs using the pdfplumber
library.
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Yield each page in the PDF using pdfplumber
.
Returns:
Type | Description |
---|---|
Generator[Tuple[int, Page], None, None]
|
Generator[pdfplumber.Page]: Generator yielding each page. |
Source code in langroid/parsing/document_parser.py
extract_text_from_page(page)
¶
Extract text from a given pdfplumber
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
Page
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the page. |
Source code in langroid/parsing/document_parser.py
ImagePdfParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDFs that are images, i.e. not "true" PDFs.
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
extract_text_from_page(page)
¶
Extract text from a given pdf2image
page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
Image
|
The PIL Image object. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the image. |
Source code in langroid/parsing/document_parser.py
UnstructuredPDFParser(source, config)
¶
Bases: DocumentParser
Parser for processing PDF files using the unstructured
library.
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
extract_text_from_page(page)
¶
Extract text from a given unstructured
element.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
unstructured element
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the element. |
Source code in langroid/parsing/document_parser.py
UnstructuredDocxParser(source, config)
¶
Bases: DocumentParser
Parser for processing DOCX files using the unstructured
library.
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
extract_text_from_page(page)
¶
Extract text from a given unstructured
element.
Note
The concept of "pages" doesn't actually exist in the .docx file format in the same way it does in formats like .pdf. A .docx file is made up of a series of elements like paragraphs and tables, but the division into pages is done dynamically based on the rendering settings (like the page size, margin size, font size, etc.).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
unstructured element
|
The |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the element. |
Source code in langroid/parsing/document_parser.py
PythonDocxParser(source, config)
¶
Bases: DocumentParser
Parser for processing DOCX files using the python-docx
library.
a path, a URL or a bytes object.
Source code in langroid/parsing/document_parser.py
iterate_pages()
¶
Simulate iterating through pages. In a DOCX file, pages are not explicitly defined, so we consider each paragraph as a separate 'page' for simplicity.
Source code in langroid/parsing/document_parser.py
extract_text_from_page(page)
¶
Extract text from a given 'page', which in this case is a single paragraph.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
list
|
A list containing a single Paragraph object. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the paragraph. |
Source code in langroid/parsing/document_parser.py
find_last_full_char(possible_unicode)
¶
Find the index of the last full character in a byte string. Args: possible_unicode (bytes): The bytes to check. Returns: int: The index of the last full unicode character.
Source code in langroid/parsing/document_parser.py
is_plain_text(path_or_bytes)
¶
Check if a file is plain text by attempting to decode it as UTF-8. Args: path_or_bytes (str|bytes): The file path or bytes object. Returns: bool: True if the file is plain text, False otherwise.