search
Utils to search for close matches in (a list of) strings. Useful for retrieval of docs/chunks relevant to a query, in the context of Retrieval-Augmented Generation (RAG), and SQLChat (e.g., to pull relevant parts of a large schema). See tests for examples: tests/main/test_string_search.py
find_fuzzy_matches_in_docs(query, docs, docs_clean, k, words_before=None, words_after=None)
¶
Find approximate matches of the query in the docs and return surrounding characters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The search string. |
required |
docs
|
List[Document]
|
List of Document objects to search through. |
required |
docs_clean
|
List[Document]
|
List of Document objects with cleaned content. |
required |
k
|
int
|
Number of best matches to return. |
required |
words_before
|
int | None
|
Number of words to include before each match. Default None => return max |
None
|
words_after
|
int | None
|
Number of words to include after each match. Default None => return max |
None
|
Returns:
| Type | Description |
|---|---|
List[Tuple[Document, float]]
|
List[Tuple[Document,float]]: List of (Document, score) tuples. |
Source code in langroid/parsing/search.py
preprocess_text(text)
¶
Preprocesses the given text by: 1. Lowercasing all words. 2. Tokenizing (splitting the text into words). 3. Removing punctuation. 4. Removing stopwords. 5. Lemmatizing words.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The preprocessed text. |
Source code in langroid/parsing/search.py
find_closest_matches_with_bm25(docs, docs_clean, query, k=5)
¶
Finds the k closest approximate matches using the BM25 algorithm.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
docs
|
List[Document]
|
List of Documents to search through. |
required |
docs_clean
|
List[Document]
|
List of cleaned Documents |
required |
query
|
str
|
The search query. |
required |
k
|
int
|
Number of matches to retrieve. Defaults to 5. |
5
|
Returns:
| Type | Description |
|---|---|
List[Tuple[Document, float]]
|
List[Tuple[Document,float]]: List of (Document, score) tuples. |
Source code in langroid/parsing/search.py
get_context(query, text, words_before=100, words_after=100)
¶
Returns a portion of text containing the best approximate match of the query, including b words before and a words after the match.
Args: query (str): The string to search for. text (str): The body of text in which to search. b (int): The number of words before the query to return. a (int): The number of words after the query to return.
str: A string containing b words before, the match, and a words after
the best approximate match position of the query in the text.
The text is extracted from the original text, preserving formatting,
whitespace, etc, so it does not disturb any downstream processing.
If no match is found, returns empty string.
int: The start position of the match in the text.
int: The end position of the match in the text.
Example:
get_context("apple", "The quick brown fox jumps over the apple.", 3, 2)
'fox jumps over the apple.'¶
Source code in langroid/parsing/search.py
eliminate_near_duplicates(passages, threshold=0.8)
¶
Eliminate near duplicate text passages from a given list using MinHash and LSH. TODO: this has not been tested and the datasketch lib is not a dependency. Args: passages (List[str]): A list of text passages. threshold (float, optional): Jaccard similarity threshold to consider two passages as near-duplicates. Default is 0.8.
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of passages after eliminating near duplicates. |
Example
passages = ["Hello world", "Hello, world!", "Hi there", "Hello world!"] print(eliminate_near_duplicates(passages))