urls
url_to_tempfile(url)
¶
Fetch content from the given URL and save it to a temporary local file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the content to fetch. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The path to the temporary file where the content is saved. |
Raises:
Type | Description |
---|---|
HTTPError
|
If there's any issue fetching the content. |
Source code in langroid/parsing/urls.py
get_user_input(msg, color='blue')
¶
Prompt the user for input. Args: msg: printed prompt color: color of the prompt Returns: user input
Source code in langroid/parsing/urls.py
get_list_from_user(prompt="Enter input (type 'done' or hit return to finish)", n=None)
¶
Prompt the user for inputs. Args: prompt: printed prompt n: how many inputs to prompt for. If None, then prompt until done, otherwise quit after n inputs. Returns: list of input strings
Source code in langroid/parsing/urls.py
get_urls_paths_bytes_indices(inputs)
¶
Given a list of inputs, return a list of indices of URLs, list of indices of paths, list of indices of byte-contents. Args: inputs: list of strings or bytes Returns: list of Indices of URLs, list of indices of paths, list of indices of byte-contents
Source code in langroid/parsing/urls.py
crawl_url(url, max_urls=1)
¶
Crawl starting at the url and return a list of URLs to be parsed,
up to a maximum of max_urls
.
This has not been tested to work as intended. Ignore.
Source code in langroid/parsing/urls.py
find_urls(url='https://en.wikipedia.org/wiki/Generative_pre-trained_transformer', max_links=20, visited=None, depth=0, max_depth=2, match_domain=True)
¶
Recursively find all URLs on a given page.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL to start from. |
'https://en.wikipedia.org/wiki/Generative_pre-trained_transformer'
|
max_links
|
int
|
The maximum number of links to find. |
20
|
visited
|
set
|
A set of URLs that have already been visited. |
None
|
depth
|
int
|
The current depth of the recursion. |
0
|
max_depth
|
int
|
The maximum depth of the recursion. |
2
|
match_domain
|
bool
|
Whether to only return URLs that are on the same domain. |
True
|
Returns:
Name | Type | Description |
---|---|---|
set |
Set[str]
|
A set of URLs found on the page. |
Source code in langroid/parsing/urls.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
|