spider
DomainSpecificSpider(start_url, k=20, *args, **kwargs)
¶
Bases: CrawlSpider
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_url |
str
|
The starting URL. |
required |
k |
int
|
The max desired final URLs. Defaults to 20. |
20
|
Source code in langroid/parsing/spider.py
parse_item(response)
¶
Extracts URLs that are within the same domain.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
response |
Response
|
The scrapy response object. |
required |
Source code in langroid/parsing/spider.py
scrapy_fetch_urls(url, k=20)
¶
Fetches up to k URLs reachable from the input URL using Scrapy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The starting URL. |
required |
k |
int
|
The max desired final URLs. Defaults to 20. |
20
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of URLs within the same domain as the input URL. |