Skip to content

url_loader

langroid/parsing/url_loader.py

URLLoader(urls, parser=Parser(ParsingConfig()))

Load a list of URLs and extract the text content. Alternative approaches could use bs4 or scrapy.

TODO - this currently does not handle cookie dialogs, i.e. if there is a cookie pop-up, most/all of the extracted content could be cookie policy text. We could use playwright to simulate a user clicking the "accept" button on the cookie dialog.

Source code in langroid/parsing/url_loader.py
def __init__(self, urls: List[str], parser: Parser = Parser(ParsingConfig())):
    self.urls = urls
    self.parser = parser