url_loader
langroid/parsing/url_loader.py
BaseCrawlerConfig
¶
Bases: BaseSettings
Base configuration for web crawlers.
TrafilaturaConfig
¶
FirecrawlConfig
¶
Crawl4aiConfig(**kwargs)
¶
Bases: BaseCrawlerConfig
Configuration for the Crawl4aiCrawler.
Source code in langroid/parsing/url_loader.py
BaseCrawler(config)
¶
Bases: ABC
Abstract base class for web crawlers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
BaseCrawlerConfig
|
Configuration for the crawler |
required |
Source code in langroid/parsing/url_loader.py
needs_parser
abstractmethod
property
¶
Indicates whether the crawler requires a parser.
CrawlerFactory
¶
Factory for creating web crawlers.
create_crawler(config)
staticmethod
¶
Create a crawler instance based on configuration type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
BaseCrawlerConfig
|
Configuration for the crawler |
required |
Returns:
Type | Description |
---|---|
BaseCrawler
|
A BaseCrawler instance |
Raises:
Type | Description |
---|---|
ValueError
|
If config type is not supported |
Source code in langroid/parsing/url_loader.py
TrafilaturaCrawler(config)
¶
Bases: BaseCrawler
Crawler implementation using Trafilatura.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
TrafilaturaConfig
|
Configuration for the crawler |
required |
Source code in langroid/parsing/url_loader.py
FirecrawlCrawler(config)
¶
Bases: BaseCrawler
Crawler implementation using Firecrawl.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
FirecrawlConfig
|
Configuration for the crawler |
required |
Source code in langroid/parsing/url_loader.py
ExaCrawler(config)
¶
Bases: BaseCrawler
Crawler implementation using Exa API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
ExaCrawlerConfig
|
Configuration for the crawler |
required |
Source code in langroid/parsing/url_loader.py
crawl(urls)
¶
Crawl the given URLs using Exa SDK.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to crawl |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List of Documents with content extracted from the URLs |
Raises:
Type | Description |
---|---|
LangroidImportError
|
If the exa package is not installed |
ValueError
|
If the Exa API key is not set |
Source code in langroid/parsing/url_loader.py
Crawl4aiCrawler(config)
¶
Bases: BaseCrawler
Crawler implementation using the crawl4ai library.
This crawler intelligently dispatches URLs. Standard web pages are rendered and scraped using the crawl4ai browser engine. Direct links to documents (PDF, DOCX, etc.) are delegated to the framework's internal DocumentParser.
Source code in langroid/parsing/url_loader.py
needs_parser
property
¶
Indicates that this crawler relies on the framework's DocumentParser for handling specific file types like PDF, DOCX, etc., which the browser engine cannot parse directly.
crawl(urls)
¶
Executes the crawl by separating document URLs from web page URLs.
- Document URLs (.pdf, .docx, etc.) are processed using
_process_document
. - Web page URLs are handled using the async crawl4ai engine.
Source code in langroid/parsing/url_loader.py
URLLoader(urls, parsing_config=ParsingConfig(), crawler_config=None)
¶
Loads URLs and extracts text using a specified crawler.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[Any]
|
List of URLs to load |
required |
parsing_config
|
ParsingConfig
|
Configuration for parsing |
ParsingConfig()
|
crawler_config
|
Optional[BaseCrawlerConfig]
|
Configuration for the crawler |
None
|