Enriching Chunked Documents for Better Retrieval¶
Available in Langroid v0.34.0 or later.
When using the DocChatAgent
for RAG with documents in highly specialized/technical
domains, retrieval accuracy may be low since embeddings are not sufficient to capture
relationships between entities, e.g. suppose a document-chunk consists of a medical
test name "BUN" (Blood Urea Nitrogen), and a retrieval query is looking for
tests related to kidney function, the embedding for "BUN" may not be close to the
embedding for "kidney function", and the chunk may not be retrieved.
In such cases it is useful to enrich the chunked documents with additional keywords (or even "hypothetical questions") to increase the "semantic surface area" of the chunk, so that the chunk is more likely to be retrieved for relevant queries.
As of Langroid v0.34.0, you can provide a chunk_enrichment_config
of type ChunkEnrichmentAgentConfig
, in the DocChatAgentConfig
.
This config extends ChatAgentConfig
and has the following fields:
batch_size
(int): The batch size for the chunk enrichment agent. Default is 50.delimiter
(str): The delimiter to use when concatenating the chunk and the enriched text.enrichment_prompt_fn
: function (str->str
) that creates a prompt from a doc-chunk stringx
In the above medical test example, suppose we want to augment a chunk containing
only the medical test name, with the organ system it is related to. We can set up
a ChunkEnrichmentAgentConfig
as follows:
from langroid.agent.special.doc.doc_chat_agent import (
ChunkEnrichmentAgentConfig,
)
enrichment_config = ChunkEnrichmentAgentConfig(
batch_size=10,
system_message=f"""
You are an experienced clinical physician, very well-versed in
medical tests and their names.
You will be asked to identify WHICH ORGAN(s) Function/Health
a test name is most closely associated with, to aid in
retrieving the medical test names more accurately from an embeddings db
that contains thousands of such test names.
The idea is to use the ORGAN NAME(S) provided by you,
to make the right test names easier to discover via keyword-matching
or semantic (embedding) similarity.
Your job is to generate up to 3 ORGAN NAMES
MOST CLOSELY associated with the test name shown, ONE PER LINE.
DO NOT SAY ANYTHING ELSE, and DO NOT BE OBLIGATED to provide 3 organs --
if there is just one or two that are most relevant, that is fine.
Examples:
"cholesterol" -> "heart function",
"LDL" -> "artery health", etc,
"PSA" -> "prostate health",
"TSH" -> "thyroid function", etc.
""",
enrichment_prompt_fn=lambda test: f"""
Which ORGAN(S) Function/Health is the medical test named
'{test}' most closely associated with?
""",
)
doc_agent_config = DocChatAgentConfig(
chunk_enrichment_config=enrichment_config,
...
)
This works as follows:
- Before ingesting document-chunks into the vector-db, a specialized
"chunk enrichment" agent is created, configured with the
enrichment_config
above. - For each document-chunk
x
, the agent'sllm_response_forget_async
method is called using the prompt created byenrichment_prompt_fn(x)
. The resulting response texty
is concatenated with the original chunk textx
using thedelimiter
, before storing in the vector-db. This is done in batches of sizebatch_size
. - At query time, after chunk retrieval, before generating the final LLM response, the enrichments are stripped from the retrieved chunks, and the original content of the retrieved chunks are passed to the LLM for generating the final response.
See the script
examples/docqa/doc-chunk-enrich.py
for a complete example. Also see the tests related to "enrichment" in
test_doc_chat_agent.py
.