Using marker
as a PDF Parser in langroid
¶
Installation¶
Standard Installation¶
To use marker
as a PDF parser in langroid
,
install it with the marker-pdf
extra:
Note, however, that due to an incompatibility with docling
,
if you install langroid
using the all
extra
(or another extra such as doc-chat
or pdf-parsers
that
also includes docling
),
e.g. pip install "langroid[all]"
, or pip install "langroid[doc-chat]"
,
then due to this version-incompatibility with docling
, you will get an
older version of marker-pdf
, which does not work with Langroid.
This may not matter if you did not intend to specifically use marker
,
but if you do want to use marker
, you will need to install langroid
with the marker-pdf
extra, as shown above, in combination with other
extras as needed, as shown above.
For Intel-Mac Users¶
If you are on an Intel Mac, docling
and marker
cannot be
installed together with langroid as extras,
due to a transformers version conflict.
To resolve this, manually install marker-pdf
with:
Make sure to install this within your langroid
virtual environment.
Example: Parsing a PDF with marker
in langroid
¶
from langroid.parsing.document_parser import DocumentParser
from langroid.parsing.parser import MarkerConfig, ParsingConfig, PdfParsingConfig
from dotenv import load_dotenv
import os
# Load environment variables
load_dotenv()
gemini_api_key = os.environ.get("GEMINI_API_KEY")
# Path to your PDF file
path = "<path_to_your_pdf_file>"
# Define parsing configuration
parsing_config = ParsingConfig(
n_neighbor_ids=2, # Number of neighboring sections to keep
pdf=PdfParsingConfig(
library="marker", # Use `marker` as the PDF parsing library
marker_config=MarkerConfig(
config_dict={
"use_llm": True, # Enable high-quality LLM processing
"gemini_api_key": gemini_api_key, # API key for Gemini LLM
}
)
),
)
# Create the parser and extract the document
marker_parser = DocumentParser.create(path, parsing_config)
doc = marker_parser.get_doc()
Explanation of Configuration Options¶
If you want to use the default configuration, you can omit marker_config
entirely.
Key Parameters in MarkerConfig
¶
Parameter | Description |
---|---|
use_llm |
Set to True to enable higher-quality processing using LLMs. |
gemini_api_key |
Google Gemini API key for LLM-enhanced parsing. |
You can further customize config_dict
by referring to marker_pdf
's documentation.
Alternatively, run the following command to view available options:
This will display all supported parameters, which you can pass as needed in config_dict
.