repo_loader
langroid/parsing/repo_loader.py
RepoLoaderConfig
¶
Bases: BaseSettings
Configuration for RepoLoader.
RepoLoader(url, config=RepoLoaderConfig())
¶
Class for recursively getting all file content in a repo.
config: configuration for RepoLoader
Source code in langroid/parsing/repo_loader.py
get_issues(k=100)
¶
Get up to k issues from the GitHub repo.
Source code in langroid/parsing/repo_loader.py
clone(path=None)
¶
Clone a GitHub repository to a local directory specified by path
,
if it has not already been cloned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The local directory where the repository should be cloned. If not specified, a temporary directory will be created. |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
Optional[str]
|
The path to the local directory where the repository was cloned. |
Source code in langroid/parsing/repo_loader.py
load_tree_from_github(depth, lines=0)
¶
Get a nested dictionary of GitHub repository file and directory names up to a certain depth, with file contents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
depth
|
int
|
The depth level. |
required |
lines
|
int
|
The number of lines of file contents to include. |
0
|
Returns:
Type | Description |
---|---|
Dict[str, Union[str, List[Dict[str, Any]]]]
|
Dict[str, Union[str, List[Dict]]]: |
Dict[str, Union[str, List[Dict[str, Any]]]]
|
A dictionary containing file and directory names, with file contents. |
Source code in langroid/parsing/repo_loader.py
load(path=None, depth=3, lines=0)
¶
From a local folder path
(if None, the repo clone path), get:
a nested dictionary (tree) of dicts, files and contents
a list of Document objects for each file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The local folder path; if none, use self.clone_path() |
None
|
depth
|
int
|
The depth level. |
3
|
lines
|
int
|
The number of lines of file contents to include. |
0
|
Returns:
Type | Description |
---|---|
Tuple[Dict[str, Union[str, List[Dict[str, Any]]]], List[Document]]
|
Tuple of (dict, List_of_Documents): A dictionary containing file and directory names, with file contents, and a list of Document objects for each file. |
Source code in langroid/parsing/repo_loader.py
load_from_folder(path, depth=3, lines=0, file_types=None, exclude_dirs=None, url='')
staticmethod
¶
From a local folder path
(required), get:
a nested dictionary (tree) of dicts, files and contents, restricting to
desired file_types and excluding undesired directories.
a list of Document objects for each file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The local folder path, required. |
required |
depth
|
int
|
The depth level. Optional, default 3. |
3
|
lines
|
int
|
The number of lines of file contents to include. Optional, default 0 (no lines => empty string). |
0
|
file_types
|
List[str]
|
The file types to include. Optional, default None (all). |
None
|
exclude_dirs
|
List[str]
|
The directories to exclude. Optional, default None (no exclusions). |
None
|
url
|
str
|
Optional url, to be stored in docs as metadata. Default "". |
''
|
Returns:
Type | Description |
---|---|
Tuple[Dict[str, Union[str, List[Dict[str, Any]]]], List[Document]]
|
Tuple of (dict, List_of_Documents): A dictionary containing file and directory names, with file contents. A list of Document objects for each file. |
Source code in langroid/parsing/repo_loader.py
394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 |
|
get_documents(path, parser=Parser(ParsingConfig()), file_types=None, exclude_dirs=None, depth=-1, lines=None, doc_type=None)
staticmethod
¶
Recursively get all files under a path as Document objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str | bytes
|
The path to the directory or file, or bytes content. The bytes option is meant to support the case where the content has already been read from a file in an upstream process (e.g. from an API or a database), and we want to avoid having to write it to a temporary file just to read it again. (which can be very slow for large files, especially in a docker container) |
required |
parser
|
Parser
|
Parser to use to parse files. |
Parser(ParsingConfig())
|
file_types
|
List[str]
|
List of file extensions OR filenames OR file_path_names to include. Defaults to None, which includes all files. |
None
|
exclude_dirs
|
List[str]
|
List of directories to exclude. Defaults to None, which includes all directories. |
None
|
depth
|
int
|
Max depth of recursion. Defaults to -1, which includes all depths. |
-1
|
lines
|
int
|
Number of lines to read from each file. Defaults to None, which reads all lines. |
None
|
doc_type
|
str | DocumentType
|
The type of document to parse. |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of Document objects representing files. |
Source code in langroid/parsing/repo_loader.py
load_docs_from_github(k=None, depth=None, lines=None)
¶
Directly from GitHub, recursively get all files in a repo that have one of the extensions, possibly up to a max number of files, max depth, and max number of lines per file (if any of these are specified).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
k
|
int
|
max number of files to load, or None for all files |
None
|
depth
|
int
|
max depth to recurse, or None for infinite depth |
None
|
lines
|
int
|
max number of lines to get, from a file, or None for all lines |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
list of Document objects, each has fields |
List[Document]
|
and |
Source code in langroid/parsing/repo_loader.py
select(structure, includes, excludes=[])
staticmethod
¶
Filter a structure dictionary for certain directories and files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
structure
|
Dict[str, Union[str, List[Dict]]]
|
The structure dictionary. |
required |
includes
|
List[str]
|
A list of desired directories and files. For files, either full file names or "file type" can be specified. E.g. "toml" will include all files with the ".toml" extension, or "Makefile" will include all files named "Makefile". |
required |
excludes
|
List[str]
|
A list of directories and files to exclude.
Similar to |
[]
|
Returns:
Type | Description |
---|---|
Dict[str, Union[str, List[Dict[str, Any]]]]
|
Dict[str, Union[str, List[Dict]]]: The filtered structure dictionary. |
Source code in langroid/parsing/repo_loader.py
ls(structure, depth=0)
staticmethod
¶
Get a list of names of files or directories up to a certain depth from a structure dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
structure
|
Dict[str, Union[str, List[Dict]]]
|
The structure dictionary. |
required |
depth
|
int
|
The depth level. Defaults to 0. |
0
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: A list of names of files or directories. |
Source code in langroid/parsing/repo_loader.py
list_files(dir, depth=1, include_types=[], exclude_types=[])
staticmethod
¶
Recursively list all files in a directory, up to a certain depth.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir
|
str
|
The directory path, relative to root. |
required |
depth
|
int
|
The depth level. Defaults to 1. |
1
|
include_types
|
List[str]
|
A list of file types to include. Defaults to empty list. |
[]
|
exclude_types
|
List[str]
|
A list of file types to exclude. Defaults to empty list. |
[]
|
Returns: List[str]: A list of file names.
Source code in langroid/parsing/repo_loader.py
show_file_contents(tree)
staticmethod
¶
Print the contents of all files from a structure dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree
|
Dict[str, Union[str, List[Dict]]]
|
The structure dictionary. |
required |