Firecrawl and Trafilatura Crawlers Documentation¶
URLLoader uses Trafilatura if not explicitly specified
Overview¶
FirecrawlCrawler: Leverages the Firecrawl API for efficient web scraping and crawling. It offers built-in document processing capabilities, and produces non-chunked markdown output from web-page content. RequiresFIRECRAWL_API_KEYenvironment variable to be set in.envfile or environment.TrafilaturaCrawler: Utilizes the Trafilatura library and Langroid's parsing tools for extracting and processing web content - this is the default crawler, and does not require setting up an external API key. Also produces chuked markdown output from web-page content.ExaCrawler: Integrates with the Exa API for high-quality content extraction. RequiresEXA_API_KEYenvironment variable to be set in.envfile or environment. This crawler also produces chunked markdown output from web-page content.
Installation¶
TrafilaturaCrawler comes with Langroid
To use FirecrawlCrawler, install the firecrawl extra:
Exa Crawler Documentation¶
Overview¶
ExaCrawler integrates with Exa API to extract high-quality content from web pages.
It provides efficient content extraction with the simplicity of API-based processing.
Parameters¶
Obtain an Exa API key from Exa and set it in your environment variables,
e.g. in your .env file as:
- config (ExaCrawlerConfig): An
ExaCrawlerConfigobject.- api_key (str): Your Exa API key.
Usage¶
from langroid.parsing.url_loader import URLLoader, ExaCrawlerConfig
# Create an ExaCrawlerConfig object
exa_config = ExaCrawlerConfig(
# Typically omitted here as it's loaded from EXA_API_KEY environment variable
api_key="your-exa-api-key"
)
loader = URLLoader(
urls=[
"https://pytorch.org",
"https://www.tensorflow.org"
],
crawler_config=exa_config
)
docs = loader.load()
print(docs)
Benefits¶
- Simple API integration requiring minimal configuration
- Efficient handling of complex web pages
- For plain html content, the
exaapi produces high-quality content extraction with clean text output with html tags, which we then convert to markdown using themarkdownifylibrary. - For "document" content (e.g.,
pdf,doc,docx), the content is downloaded via theexaAPI and langroid's document-processing tools are used to produce chunked output in a format controlled by theParserconfiguration (defaults to markdown in most cases).
Trafilatura Crawler Documentation¶
Overview¶
TrafilaturaCrawler is a web crawler that uses the Trafilatura library for content extraction
and Langroid's parsing capabilities for further processing.
Parameters¶
- config (TrafilaturaConfig): A
TrafilaturaConfigobject that specifies parameters related to scraping or output format.threads(int): The number of threads to use for downloading web pages.format(str): one of"markdown"(default),"xml"or"txt"; in case ofxml, the output is in html format.
Similar to the ExaCrawler, the TrafilaturaCrawler works differently depending on
the type of web-page content:
- for "document" content (e.g., pdf, doc, docx), the content is downloaded
and parsed with Langroid's document-processing tools are used to produce chunked output
in a format controlled by the Parser configuration (defaults to markdown in most cases).
- for plain-html content, the output format is based on the format parameter;
- if this parameter is markdown (default), the library extracts content in
markdown format, and the final output is a list of chunked markdown documents.
- if this parameter is xml, content is extracted in html format, which
langroid then converts to markdown using the markdownify library, and the final
output is a list of chunked markdown documents.
- if this parameter is txt, the content is extracted in plain text format, and the final
output is a list of plain text documents.
Usage¶
from langroid.parsing.url_loader import URLLoader, TrafilaturaConfig
# Create a TrafilaturaConfig instance
trafilatura_config = TrafilaturaConfig(threads=4)
loader = URLLoader(
urls=[
"https://pytorch.org",
"https://www.tensorflow.org",
"https://ai.google.dev/gemini-api/docs",
"https://books.toscrape.com/"
],
crawler_config=trafilatura_config,
)
docs = loader.load()
print(docs)
Langroid Parser Integration¶
TrafilaturaCrawler relies on a Langroid Parser to handle document processing.
The Parser uses the default parsing methods or with a configuration that
can be adjusted to suit the current use case.
Firecrawl Crawler Documentation¶
Overview¶
FirecrawlCrawler is a web crawling utility class that uses the Firecrawl API
to scrape or crawl web pages efficiently. It offers two modes:
- Scrape Mode (default): Extracts content from a list of specified URLs.
- Crawl Mode: Recursively follows links from a starting URL,
gathering content from multiple pages, including subdomains, while bypassing blockers.
Note:crawlmode accepts only ONE URL as a list.
Parameters¶
Obtain a Firecrawl API key from Firecrawl and set it in
your environment variables, e.g. in your .env file as
-
config (FirecrawlConfig): A
FirecrawlConfigobject.- timeout (int, optional): Time in milliseconds (ms) to wait for a response.
Default is
30000ms(30 seconds). In crawl mode, this applies per URL. - limit (int, optional): Maximum number of pages to scrape in crawl mode. Helps control API usage.
- params (dict, optional): Additional parameters to customize the request. See the scrape API and crawl API for details.
- timeout (int, optional): Time in milliseconds (ms) to wait for a response.
Default is
Usage¶
Scrape Mode (Default)¶
Fetch content from multiple URLs:
from langroid.parsing.url_loader import URLLoader, FirecrawlConfig
from langroid.parsing.document_parser import
# create a FirecrawlConfig object
firecrawl_config = FirecrawlConfig(
# typical/best practice is to omit the api_key, and
# we leverage Pydantic BaseSettings to load it from the environment variable
# FIRECRAWL_API_KEY in your .env file
api_key="your-firecrawl-api-key",
timeout=15000, # Timeout per request (15 sec)
mode="scrape",
)
loader = URLLoader(
urls=[
"https://pytorch.org",
"https://www.tensorflow.org",
"https://ai.google.dev/gemini-api/docs",
"https://books.toscrape.com/"
],
crawler_config=firecrawl_config
)
docs = loader.load()
print(docs)
Crawl Mode¶
Fetch content from multiple pages starting from a single URL:
from langroid.parsing.url_loader import URLLoader, FirecrawlConfig
# create a FirecrawlConfig object
firecrawl_config = FirecrawlConfig(
timeout=30000, # 10 sec per page
mode="crawl",
params={
"limit": 5,
}
)
loader = URLLoader(
urls=["https://books.toscrape.com/"],
crawler_config=firecrawl_config
)
docs = loader.load()
print(docs)
Output¶
Results are stored in the firecrawl_output directory.
Best Practices¶
- Set
limitin crawl mode to avoid excessive API usage. - Adjust
timeoutbased on network conditions and website responsiveness. - Use
paramsto customize scraping behavior based on Firecrawl API capabilities.
Firecrawl's Built-In Document Processing¶
FirecrawlCrawler benefits from Firecrawl's built-in document processing,
which automatically extracts and structures content from web pages (including pdf,doc,docx).
This reduces the need for complex parsing logic within Langroid.
Unlike the Exa and Trafilatura crawlers, the resulting documents are
non-chunked markdown documents.
Choosing a Crawler¶
- Use
FirecrawlCrawlerwhen you need efficient, API-driven scraping with built-in document processing. This is often the simplest and most effective choice, but incurs a cost due to the paid API. - Use
TrafilaturaCrawlerwhen you want local non API based scraping (less accurate ). - Use
ExaCrawlweras a sort of middle-ground between the two, with high-quality content extraction for plain html content, but rely on Langroid's document processing tools for document content. This will cost significantly less than Firecrawl.
Example script¶
See the script examples/docqa/chat_search.py
which shows how to use a Langroid agent to search the web and scrape URLs to answer questions.