search
Utils to search for close matches in (a list of) strings. Useful for retrieval of docs/chunks relevant to a query, in the context of Retrieval-Augmented Generation (RAG), and SQLChat (e.g., to pull relevant parts of a large schema). See tests for examples: tests/main/test_string_search.py
find_fuzzy_matches_in_docs(query, docs, docs_clean, k, words_before=None, words_after=None)
¶
Find approximate matches of the query in the docs and return surrounding characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
The search string. |
required |
docs |
List[Document]
|
List of Document objects to search through. |
required |
docs_clean |
List[Document]
|
List of Document objects with cleaned content. |
required |
k |
int
|
Number of best matches to return. |
required |
words_before |
int | None
|
Number of words to include before each match. Default None => return max |
None
|
words_after |
int | None
|
Number of words to include after each match. Default None => return max |
None
|
Returns:
Type | Description |
---|---|
List[Tuple[Document, float]]
|
List[Tuple[Document,float]]: List of (Document, score) tuples. |
Source code in langroid/parsing/search.py
preprocess_text(text)
¶
Preprocesses the given text by: 1. Lowercasing all words. 2. Tokenizing (splitting the text into words). 3. Removing punctuation. 4. Removing stopwords. 5. Lemmatizing words.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input text. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The preprocessed text. |
Source code in langroid/parsing/search.py
find_closest_matches_with_bm25(docs, docs_clean, query, k=5)
¶
Finds the k closest approximate matches using the BM25 algorithm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
docs |
List[Document]
|
List of Documents to search through. |
required |
docs_clean |
List[Document]
|
List of cleaned Documents |
required |
query |
str
|
The search query. |
required |
k |
int
|
Number of matches to retrieve. Defaults to 5. |
5
|
Returns:
Type | Description |
---|---|
List[Tuple[Document, float]]
|
List[Tuple[Document,float]]: List of (Document, score) tuples. |
Source code in langroid/parsing/search.py
get_context(query, text, words_before=100, words_after=100)
¶
Returns a portion of text containing the best approximate match of the query, including b words before and a words after the match.
Args: query (str): The string to search for. text (str): The body of text in which to search. b (int): The number of words before the query to return. a (int): The number of words after the query to return.
str: A string containing b words before, the match, and a words after the best approximate match position of the query in the text. If no match is found, returns empty string. int: The start position of the match in the text. int: The end position of the match in the text.
Example:
get_context("apple", "The quick brown fox jumps over the apple.", 3, 2)
'fox jumps over the apple.'¶
Source code in langroid/parsing/search.py
eliminate_near_duplicates(passages, threshold=0.8)
¶
Eliminate near duplicate text passages from a given list using MinHash and LSH. TODO: this has not been tested and the datasketch lib is not a dependency. Args: passages (List[str]): A list of text passages. threshold (float, optional): Jaccard similarity threshold to consider two passages as near-duplicates. Default is 0.8.
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: A list of passages after eliminating near duplicates. |
Example
passages = ["Hello world", "Hello, world!", "Hi there", "Hello world!"] print(eliminate_near_duplicates(passages))