utils
batched(iterable, n)
¶
Batch data into tuples of length n. The last batch may be shorter.
Source code in langroid/parsing/utils.py
closest_string(query, string_list)
¶
Find the closest match to the query in a list of strings.
This function is case-insensitive and ignores leading and trailing whitespace. If no match is found, it returns 'No match found'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The string to match. |
required |
string_list
|
List[str]
|
The list of strings to search. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The closest match to the query from the list, or 'No match found' if no match is found. |
Source code in langroid/parsing/utils.py
split_paragraphs(text)
¶
Split the input text into paragraphs using "
" as the delimiter.
Args:
text (str): The input text.
Returns:
list: A list of paragraphs.
Source code in langroid/parsing/utils.py
split_newlines(text)
¶
Split the input text into lines using "
" as the delimiter.
Args:
text (str): The input text.
Returns:
list: A list of lines.
Source code in langroid/parsing/utils.py
number_segments(s, granularity=1)
¶
Number the segments in a given text, preserving paragraph structure.
A segment is a sequence of len
consecutive "sentences", where a "sentence"
is either a normal sentence, or if there isn't enough punctuation to properly
identify sentences, then we use a pseudo-sentence via heuristics (split by newline
or failing that, just split every 40 words). The goal here is simply to number
segments at a reasonable granularity so the LLM can identify relevant segments,
in the RelevanceExtractorAgent.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s
|
str
|
The input text. |
required |
granularity
|
int
|
The number of sentences in a segment. If this is -1, then the entire text is treated as a single segment, and is numbered as <#1#>. |
1
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with segments numbered in the style <#1#>, <#2#> etc. |
Example
number_segments("Hello world! How are you? Have a good day.") '<#1#> Hello world! <#2#> How are you? <#3#> Have a good day.'
Source code in langroid/parsing/utils.py
parse_number_range_list(specs)
¶
Parse a specs string like "3,5,7-10" into a list of integers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
specs
|
str
|
A string containing segment numbers and/or ranges (e.g., "3,5,7-10"). |
required |
Returns:
Type | Description |
---|---|
List[int]
|
List[int]: List of segment numbers. |
Example
parse_number_range_list("3,5,7-10") [3, 5, 7, 8, 9, 10]
Source code in langroid/parsing/utils.py
strip_k(s, k=2)
¶
Strip any leading and trailing whitespaces from the input text beyond length k. This is useful for removing leading/trailing whitespaces from a text while preserving paragraph structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s
|
str
|
The input text. |
required |
k
|
int
|
The number of leading and trailing whitespaces to retain. |
2
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The text with leading and trailing whitespaces removed beyond length k. |
Source code in langroid/parsing/utils.py
clean_whitespace(text)
¶
Remove extra whitespace from the input text, while preserving paragraph structure.
Source code in langroid/parsing/utils.py
extract_numbered_segments(s, specs)
¶
Extract specified segments from a numbered text, preserving paragraph structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s
|
str
|
The input text containing numbered segments. |
required |
specs
|
str
|
A string containing segment numbers and/or ranges (e.g., "3,5,7-10"). |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted segments, keeping original paragraph structures. |
Example
text = "(1) Hello world! (2) How are you? (3) Have a good day." extract_numbered_segments(text, "1,3") 'Hello world! Have a good day.'
Source code in langroid/parsing/utils.py
extract_content_from_path(path, parsing, doc_type=None)
¶
Extract the content from a file path or URL, or a list of file paths or URLs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
bytes | str | List[str]
|
The file path or URL, or a list of file paths or URLs, or bytes content. The bytes option is meant to support cases where upstream code may have already loaded the content (e.g., from a database or API) and we want to avoid having to copy the content to a temporary file. |
required |
parsing
|
ParsingConfig
|
The parsing configuration. |
required |
doc_type
|
str | DocumentType | None
|
The document type if known. If multiple paths are given, this MUST apply to ALL docs. |
None
|
Returns:
Type | Description |
---|---|
str | List[str]
|
str | List[str]: The extracted content if a single file path or URL is provided, or a list of extracted contents if a list of file paths or URLs is provided. |