models

`FastEmbedEmbeddingsConfig` ¶

Bases: EmbeddingModelsConfig

Config for qdrant/fastembed embeddings, see here: https://github.com/qdrant/fastembed

`EmbeddingFunctionCallable(embed_model, batch_size=512)` ¶

A callable class designed to generate embeddings for a list of texts using the OpenAI or Azure OpenAI API, with automatic retries on failure.

Attributes:

Name	Type	Description
`embed_model`	`EmbeddingModel`	An instance of EmbeddingModel that provides configuration and utilities for generating embeddings.

Methods:

Name	Description
`__call__`	List[str]) -> Embeddings: Generate embeddings for a list of input texts.

Parameters:

Name	Type	Description	Default
`model`	`OpenAIEmbeddings or AzureOpenAIEmbeddings`	An instance of OpenAIEmbeddings or AzureOpenAIEmbeddings to use for generating embeddings.	required
`batch_size`	`int`	Batch size	`512`

Source code in langroid/embedding_models/models.py

def __init__(self, embed_model: EmbeddingModel, batch_size: int = 512):
    """
    Initialize the EmbeddingFunctionCallable with a specific model.

    Args:
        model ( OpenAIEmbeddings or AzureOpenAIEmbeddings): An instance of
                        OpenAIEmbeddings or AzureOpenAIEmbeddings to use for
                        generating embeddings.
        batch_size (int): Batch size
    """
    self.embed_model = embed_model
    self.batch_size = batch_size

`OpenAIEmbeddings(config=OpenAIEmbeddingsConfig())` ¶

Bases: EmbeddingModel

Source code in langroid/embedding_models/models.py

def __init__(self, config: OpenAIEmbeddingsConfig = OpenAIEmbeddingsConfig()):
    super().__init__()
    self.config = config
    load_dotenv()

    # Check if using LangDB
    self.is_langdb = self.config.model_name.startswith("langdb/")

    if self.is_langdb:
        self.config.model_name = self.config.model_name.replace("langdb/", "")
        self.config.api_base = self.config.langdb_params.base_url
        project_id = self.config.langdb_params.project_id
        if project_id:
            self.config.api_base += "/" + project_id + "/v1"
        self.config.api_key = self.config.langdb_params.api_key

    if not self.config.api_key:
        self.config.api_key = os.getenv("OPENAI_API_KEY", "")

    self.config.organization = os.getenv("OPENAI_ORGANIZATION", "")

    if self.config.api_key == "":
        if self.is_langdb:
            raise ValueError(
                """
                LANGDB_API_KEY must be set in .env or your environment 
                to use OpenAIEmbeddings via LangDB.
                """
            )
        else:
            raise ValueError(
                """
                OPENAI_API_KEY must be set in .env or your environment 
                to use OpenAIEmbeddings.
                """
            )

    self.client = OpenAI(
        base_url=self.config.api_base,
        api_key=self.config.api_key,
        organization=self.config.organization,
    )
    model_for_tokenizer = self.config.model_name
    if model_for_tokenizer.startswith("openai/"):
        self.config.model_name = model_for_tokenizer.replace("openai/", "")
    self.tokenizer = tiktoken.encoding_for_model(self.config.model_name)

`truncate_texts(texts)` ¶

Truncate texts to the embedding model's context length. TODO: Maybe we should show warning, and consider doing T5 summarization?

Source code in langroid/embedding_models/models.py

def truncate_texts(self, texts: List[str]) -> List[str] | List[List[int]]:
    """
    Truncate texts to the embedding model's context length.
    TODO: Maybe we should show warning, and consider doing T5 summarization?
    """
    truncated_tokens = [
        self.tokenizer.encode(text, disallowed_special=())[
            : self.config.context_length
        ]
        for text in texts
    ]

    if self.is_langdb:
        # LangDB embedding endpt only works with strings, not tokens
        return [self.tokenizer.decode(tokens) for tokens in truncated_tokens]
    return truncated_tokens

`AzureOpenAIEmbeddings(config=AzureOpenAIEmbeddingsConfig())` ¶

Bases: EmbeddingModel

Azure OpenAI embeddings model implementation.

Parameters:

Name	Type	Description	Default
`config`	`AzureOpenAIEmbeddingsConfig`	Configuration for Azure OpenAI embeddings model.	`AzureOpenAIEmbeddingsConfig()`

Raises: ValueError: If required Azure config values are not set.

Source code in langroid/embedding_models/models.py

def __init__(
    self, config: AzureOpenAIEmbeddingsConfig = AzureOpenAIEmbeddingsConfig()
):
    """
    Initializes Azure OpenAI embeddings model.

    Args:
        config: Configuration for Azure OpenAI embeddings model.
    Raises:
        ValueError: If required Azure config values are not set.
    """
    super().__init__()
    self.config = config
    load_dotenv()

    if self.config.api_key == "":
        raise ValueError(
            """AZURE_OPENAI_API_KEY env variable must be set to use 
        AzureOpenAIEmbeddings. Please set the AZURE_OPENAI_API_KEY value 
        in your .env file."""
        )

    if self.config.api_base == "":
        raise ValueError(
            """AZURE_OPENAI_API_BASE env variable must be set to use 
        AzureOpenAIEmbeddings. Please set the AZURE_OPENAI_API_BASE value 
        in your .env file."""
        )
    self.client = AzureOpenAI(
        api_key=self.config.api_key,
        api_version=self.config.api_version,
        azure_endpoint=self.config.api_base,
        azure_deployment=self.config.deployment_name,
    )
    self.tokenizer = tiktoken.encoding_for_model(self.config.model_name)

`truncate_texts(texts)` ¶

Truncate texts to the embedding model's context length. TODO: Maybe we should show warning, and consider doing T5 summarization?

Source code in langroid/embedding_models/models.py

def truncate_texts(self, texts: List[str]) -> List[str] | List[List[int]]:
    """
    Truncate texts to the embedding model's context length.
    TODO: Maybe we should show warning, and consider doing T5 summarization?
    """
    return [
        self.tokenizer.encode(text, disallowed_special=())[
            : self.config.context_length
        ]
        for text in texts
    ]

`embedding_fn()` ¶

Get the embedding function for Azure OpenAI.

Returns:

Type	Description
`Callable[[List[str]], Embeddings]`	Callable that generates embeddings for input texts.

Source code in langroid/embedding_models/models.py

def embedding_fn(self) -> Callable[[List[str]], Embeddings]:
    """Get the embedding function for Azure OpenAI.

    Returns:
        Callable that generates embeddings for input texts.
    """
    return EmbeddingFunctionCallable(self, self.config.batch_size)

`GeminiEmbeddings(config=GeminiEmbeddingsConfig())` ¶

Bases: EmbeddingModel

Source code in langroid/embedding_models/models.py

def __init__(self, config: GeminiEmbeddingsConfig = GeminiEmbeddingsConfig()):
    try:
        from google import genai
    except ImportError as e:
        raise LangroidImportError(extra="google-genai", error=str(e))
    super().__init__()
    self.config = config
    load_dotenv()
    self.config.api_key = os.getenv("GEMINI_API_KEY", "")

    if self.config.api_key == "":
        raise ValueError(
            """
            GEMINI_API_KEY env variable must be set to use GeminiEmbeddings.
            """
        )
    self.client = genai.Client(api_key=self.config.api_key)

`generate_embeddings(texts)` ¶

Generates embeddings for a list of input texts.

Source code in langroid/embedding_models/models.py

def generate_embeddings(self, texts: List[str]) -> List[List[float]]:
    """Generates embeddings for a list of input texts."""
    all_embeddings: List[List[float]] = []

    for batch in batched(texts, self.config.batch_size):
        result = self.client.models.embed_content(  # type: ignore[attr-defined]
            model=self.config.model_name,
            contents=batch,  # type: ignore
        )

        if not hasattr(result, "embeddings") or not isinstance(
            result.embeddings, list
        ):
            raise ValueError(
                "Unexpected format for embeddings: missing or incorrect type"
            )

        # Extract .values from ContentEmbedding objects
        all_embeddings.extend(
            [emb.values for emb in result.embeddings]  # type: ignore
        )

    return all_embeddings

`embedding_model(embedding_fn_type='openai')` ¶

Parameters:

Name	Type	Description	Default
`embedding_fn_type`	`str`	Type of embedding model to use. Options are: - "openai", - "azure-openai", - "sentencetransformer", or - "fastembed". (others may be added in the future)	`'openai'`

Returns: EmbeddingModel: The corresponding embedding model class.

Source code in langroid/embedding_models/models.py

def embedding_model(embedding_fn_type: str = "openai") -> EmbeddingModel:
    """
    Args:
        embedding_fn_type: Type of embedding model to use. Options are:
         - "openai",
         - "azure-openai",
         - "sentencetransformer", or
         - "fastembed".
            (others may be added in the future)
    Returns:
        EmbeddingModel: The corresponding embedding model class.
    """
    if embedding_fn_type == "openai":
        return OpenAIEmbeddings  # type: ignore
    elif embedding_fn_type == "azure-openai":
        return AzureOpenAIEmbeddings  # type: ignore
    elif embedding_fn_type == "fastembed":
        return FastEmbedEmbeddings  # type: ignore
    elif embedding_fn_type == "llamacppserver":
        return LlamaCppServerEmbeddings  # type: ignore
    elif embedding_fn_type == "gemini":
        return GeminiEmbeddings  # type: ignore
    else:  # default sentence transformer
        return SentenceTransformerEmbeddings  # type: ignore

models

FastEmbedEmbeddingsConfig ¶

EmbeddingFunctionCallable(embed_model, batch_size=512) ¶

OpenAIEmbeddings(config=OpenAIEmbeddingsConfig()) ¶

truncate_texts(texts) ¶

AzureOpenAIEmbeddings(config=AzureOpenAIEmbeddingsConfig()) ¶

truncate_texts(texts) ¶

embedding_fn() ¶

GeminiEmbeddings(config=GeminiEmbeddingsConfig()) ¶

generate_embeddings(texts) ¶

embedding_model(embedding_fn_type='openai') ¶

`FastEmbedEmbeddingsConfig` ¶

`EmbeddingFunctionCallable(embed_model, batch_size=512)` ¶

`OpenAIEmbeddings(config=OpenAIEmbeddingsConfig())` ¶

`truncate_texts(texts)` ¶

`AzureOpenAIEmbeddings(config=AzureOpenAIEmbeddingsConfig())` ¶

`truncate_texts(texts)` ¶

`embedding_fn()` ¶

`GeminiEmbeddings(config=GeminiEmbeddingsConfig())` ¶

`generate_embeddings(texts)` ¶

`embedding_model(embedding_fn_type='openai')` ¶