urls
url_to_tempfile(url)
¶
Fetch content from the given URL and save it to a temporary local file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the content to fetch. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The path to the temporary file where the content is saved. |
Raises:
Type | Description |
---|---|
HTTPError
|
If there's any issue fetching the content. |
Source code in langroid/parsing/urls.py
get_user_input(msg, color='blue')
¶
Prompt the user for input. Args: msg: printed prompt color: color of the prompt Returns: user input
Source code in langroid/parsing/urls.py
get_list_from_user(prompt="Enter input (type 'done' or hit return to finish)", n=None)
¶
Prompt the user for inputs. Args: prompt: printed prompt n: how many inputs to prompt for. If None, then prompt until done, otherwise quit after n inputs. Returns: list of input strings
Source code in langroid/parsing/urls.py
get_urls_paths_bytes_indices(inputs)
¶
Given a list of inputs, return a list of indices of URLs, list of indices of paths, list of indices of byte-contents. Args: inputs: list of strings or bytes Returns: list of Indices of URLs, list of indices of paths, list of indices of byte-contents
Source code in langroid/parsing/urls.py
crawl_url(url, max_urls=1)
¶
Crawl starting at the url and return a list of URLs to be parsed,
up to a maximum of max_urls
.
This has not been tested to work as intended. Ignore.
Source code in langroid/parsing/urls.py
find_urls(url='https://en.wikipedia.org/wiki/Generative_pre-trained_transformer', max_links=20, visited=None, depth=0, max_depth=2, match_domain=True)
¶
Recursively find all URLs on a given page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL to start from. |
'https://en.wikipedia.org/wiki/Generative_pre-trained_transformer'
|
max_links
|
int
|
The maximum number of links to find. |
20
|
visited
|
set
|
A set of URLs that have already been visited. |
None
|
depth
|
int
|
The current depth of the recursion. |
0
|
max_depth
|
int
|
The maximum depth of the recursion. |
2
|
match_domain
|
bool
|
Whether to only return URLs that are on the same domain. |
True
|
Returns:
Name | Type | Description |
---|---|---|
set |
Set[str]
|
A set of URLs found on the page. |