utils
            batched(iterable, n)
¶
    Batch data into tuples of length n. The last batch may be shorter.
Source code in langroid/parsing/utils.py
              
            closest_string(query, string_list)
¶
    Find the closest match to the query in a list of strings.
This function is case-insensitive and ignores leading and trailing whitespace. If no match is found, it returns 'No match found'.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| query | str | The string to match. | required | 
| string_list | List[str] | The list of strings to search. | required | 
Returns:
| Name | Type | Description | 
|---|---|---|
| str | str | The closest match to the query from the list, or 'No match found' if no match is found. | 
Source code in langroid/parsing/utils.py
              
            split_paragraphs(text)
¶
    Split the input text into paragraphs using "
" as the delimiter.
Args:
    text (str): The input text.
Returns:
    list: A list of paragraphs.
Source code in langroid/parsing/utils.py
              
            split_newlines(text)
¶
    Split the input text into lines using "
" as the delimiter.
Args:
    text (str): The input text.
Returns:
    list: A list of lines.
Source code in langroid/parsing/utils.py
              
            number_segments(s, granularity=1)
¶
    Number the segments in a given text, preserving paragraph structure.
A segment is a sequence of len consecutive "sentences", where a "sentence"
is either a normal sentence, or if there isn't enough punctuation to properly
identify sentences, then we use a pseudo-sentence via heuristics (split by newline
or failing that, just split every 40 words). The goal here is simply to number
segments at a reasonable granularity so the LLM can identify relevant segments,
in the RelevanceExtractorAgent.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| s | str | The input text. | required | 
| granularity | int | The number of sentences in a segment. If this is -1, then the entire text is treated as a single segment, and is numbered as <#1#>. | 1 | 
Returns:
| Name | Type | Description | 
|---|---|---|
| str | str | The text with segments numbered in the style <#1#>, <#2#> etc. | 
Example
number_segments("Hello world! How are you? Have a good day.") '<#1#> Hello world! <#2#> How are you? <#3#> Have a good day.'
Source code in langroid/parsing/utils.py
              
            parse_number_range_list(specs)
¶
    Parse a specs string like "3,5,7-10" into a list of integers.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| specs | str | A string containing segment numbers and/or ranges (e.g., "3,5,7-10"). | required | 
Returns:
| Type | Description | 
|---|---|
| List[int] | List[int]: List of segment numbers. | 
Example
parse_number_range_list("3,5,7-10") [3, 5, 7, 8, 9, 10]
Source code in langroid/parsing/utils.py
              
            strip_k(s, k=2)
¶
    Strip any leading and trailing whitespaces from the input text beyond length k. This is useful for removing leading/trailing whitespaces from a text while preserving paragraph structure.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| s | str | The input text. | required | 
| k | int | The number of leading and trailing whitespaces to retain. | 2 | 
Returns:
| Name | Type | Description | 
|---|---|---|
| str | str | The text with leading and trailing whitespaces removed beyond length k. | 
Source code in langroid/parsing/utils.py
              
            clean_whitespace(text)
¶
    Remove extra whitespace from the input text, while preserving paragraph structure.
Source code in langroid/parsing/utils.py
              
            extract_numbered_segments(s, specs)
¶
    Extract specified segments from a numbered text, preserving paragraph structure.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| s | str | The input text containing numbered segments. | required | 
| specs | str | A string containing segment numbers and/or ranges (e.g., "3,5,7-10"). | required | 
Returns:
| Name | Type | Description | 
|---|---|---|
| str | str | Extracted segments, keeping original paragraph structures. | 
Example
text = "(1) Hello world! (2) How are you? (3) Have a good day." extract_numbered_segments(text, "1,3") 'Hello world! Have a good day.'
Source code in langroid/parsing/utils.py
              
            extract_content_from_path(path, parsing, doc_type=None)
¶
    Extract the content from a file path or URL, or a list of file paths or URLs.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| path | bytes | str | List[str] | The file path or URL, or a list of file paths or URLs, or bytes content. The bytes option is meant to support cases where upstream code may have already loaded the content (e.g., from a database or API) and we want to avoid having to copy the content to a temporary file. | required | 
| parsing | ParsingConfig | The parsing configuration. | required | 
| doc_type | str | DocumentType | None | The document type if known. If multiple paths are given, this MUST apply to ALL docs. | None | 
Returns:
| Type | Description | 
|---|---|
| str | List[str] | str | List[str]: The extracted content if a single file path or URL is provided, or a list of extracted contents if a list of file paths or URLs is provided. |