Skip to content

Setting up a local LLM to work with Langroid

Examples scripts in examples/ directory.

There are numerous examples of scripts that can be run with local LLMs, in the examples/ directory of the main langroid repo. These examples are also in the langroid-examples, although the latter repo may contain some examples that are not in the langroid repo. Most of these example scripts allow you to specify an LLM in the format -m <model>, where the specification of <model> is described in the quide below for local/open LLMs, or in the Non-OpenAI LLM guide. Scripts that have the string local in their name have been especially designed to work with certain local LLMs, as described in the respective scripts. If you want a pointer to a specific script that illustrates a 2-agent chat, have a look at chat-search-assistant.py. This specific script, originally designed for GPT-4/GPT-4o, works well with llama3-70b (tested via Groq, mentioned below).

Easiest: with Ollama

As of version 0.1.24, Ollama provides an OpenAI-compatible API server for the LLMs it supports, which massively simplifies running these LLMs with Langroid. Example below.

ollama pull mistral:7b-instruct-v0.2-q8_0
This provides an OpenAI-compatible server for the mistral:7b-instruct-v0.2-q8_0 model.

You can run any Langroid script using this model, by setting the chat_model in the OpenAIGPTConfig to ollama/mistral:7b-instruct-v0.2-q8_0, e.g.

import langroid.language_models as lm
import langroid as lr

llm_config = lm.OpenAIGPTConfig(
    chat_model="ollama/mistral:7b-instruct-v0.2-q8_0",
    chat_context_length=16_000, # adjust based on model
)
agent_config = lr.ChatAgentConfig(
    llm=llm_config,
    system_message="You are helpful but concise",
)
agent = lr.ChatAgent(agent_config)
# directly invoke agent's llm_response method
# response = agent.llm_response("What is the capital of Russia?")
task = lr.Task(agent, interactive=True)
task.run() # for an interactive chat loop

Setup Ollama with a GGUF model from HuggingFace

Some models are not directly supported by Ollama out of the box. To server a GGUF model with Ollama, you can download the model from HuggingFace and set up a custom Modelfile for it.

E.g. download the GGUF version of dolphin-mixtral from here

(specifically, download this file dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf)

To set up a custom ollama model based on this:

  • Save this model at a convenient place, e.g. ~/.ollama/models/
  • Create a modelfile for this model. First see what an existing modelfile for a similar model looks like, e.g. by running:

ollama show --modelfile dolphin-mixtral:latest
You will notice this file has a FROM line followed by a prompt template and other settings. Create a new file with these contents. Only change the FROM ... line with the path to the model you downloaded, e.g.
FROM /Users/blah/.ollama/models/dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf

  • Save this modelfile somewhere, e.g. ~/.ollama/modelfiles/dolphin-mixtral-gguf
  • Create a new ollama model based on this file:

    ollama create dolphin-mixtral-gguf -f ~/.ollama/modelfiles/dolphin-mixtral-gguf
    

  • Run this new model using ollama run dolphin-mixtral-gguf

To use this model with Langroid you can then specify ollama/dolphin-mixtral-gguf as the chat_model param in the OpenAIGPTConfig as in the previous section. When a script supports it, you can also pass in the model name via -m ollama/dolphin-mixtral-gguf

Setup llama.cpp with a GGUF model from HuggingFace

See llama.cpp's GitHub page for build and installation instructions.

After installation, begin as above with downloading a GGUF model from HuggingFace; for example, the quantized Qwen2.5-Coder-7B from here; specifically, this file.

Now, the server can be started with llama-server -m qwen2.5-coder-7b-instruct-q2_k.gguf.

In addition, your llama.cpp may be built with support for simplified management of HuggingFace models (specifically, libcurl support is required); in this case, llama.cpp will download HuggingFace models to a cache directory, and the server may be run with:

llama-server \
      --hf-repo Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
      --hf-file qwen2.5-coder-7b-instruct-q2_k.gguf

To use the model with Langroid, specify llamacpp/localhost:{port} as the chat_model; the default port is 8080.

Setup vLLM with a model from HuggingFace

See the vLLM docs for installation and configuration options. To run a HuggingFace model with vLLM, use vllm serve, which provides an OpenAI-compatible server.

For example, to run Qwen2.5-Coder-32B, run vllm serve Qwen/Qwen2.5-Coder-32B.

If the model is not publicly available, set the environment varaible HF_TOKEN to your HuggingFace token with read access to the model repo.

To use the model with Langroid, specify vllm/Qwen/Qwen2.5-Coder-32B as the chat_model and, if a port other than the default 8000 was used, set api_base to localhost:{port}.

Setup vLLM with a GGUF model from HuggingFace

vLLM supports running quantized models from GGUF files; however, this is currently an experimental feature. To run a quantized Qwen2.5-Coder-32B, download the model from the repo, specifically this file.

The model can now be run with vllm serve qwen2.5-coder-32b-instruct-q4_0.gguf --tokenizer Qwen/Qwen2.5-Coder-32B (the tokenizer of the base model rather than the quantized model should be used).

To use the model with Langroid, specify vllm/qwen2.5-coder-32b-instruct-q4_0.gguf as the chat_model and, if a port other than the default 8000 was used, set api_base to localhost:{port}.

"Local" LLMs hosted on Groq

In this scenario, an open-source LLM (e.g. llama3.1-8b-instant) is hosted on a Groq server which provides an OpenAI-compatible API. Using this with langroid is exactly analogous to the Ollama scenario above: you can set the chat_model in the OpenAIGPTConfig to groq/<model_name>, e.g. groq/llama3.1-8b-instant. For this to work, ensure you have a GROQ_API_KEY environment variable set in your .env file. See groq docs.

"Local" LLMs hosted on Cerebras

This works exactly like with Groq, except you set up a CEREBRAS_API_KEY environment variable, and specify the chat_model as cerebras/<model_name>, e.g. cerebras/llama3.1-8b. See the Cerebras docs for details on which LLMs are supported.

"Local" LLMs hosted on GLHF.chat

See glhf.chat for a list of available models.

To run with one of these models, set the chat_model in the OpenAIGPTConfig to "glhf/<model_name>", where model_name is hf: followed by the HuggingFace repo path, e.g. Qwen/Qwen2.5-Coder-32B-Instruct, so the full chat_model would be "glhf/hf:Qwen/Qwen2.5-Coder-32B-Instruct".

Other non-OpenAI LLMs supported by LiteLLM

For other scenarios of running local/remote LLMs, it is possible that the LiteLLM library supports an "OpenAI adaptor" for these models (see their docs).

Depending on the specific model, the litellm docs may say you need to specify a model in the form <provider>/<model>, e.g. palm/chat-bison. To use the model with Langroid, simply prepend litellm/ to this string, e.g. litellm/palm/chat-bison, when you specify the chat_model in the OpenAIGPTConfig.

To use litellm, ensure you have the litellm extra installed, via pip install langroid[litellm] or equivalent.

Harder: with oobabooga

Like Ollama, oobabooga/text-generation-webui provides an OpenAI-API-compatible API server, but the setup is significantly more involved. See their github page for installation and model-download instructions.

Once you have finished the installation, you can spin up the server for an LLM using something like this:

python server.py --api --model mistral-7b-instruct-v0.2.Q8_0.gguf --verbose --extensions openai --nowebui
This will show a message saying that the OpenAI-compatible API is running at http://127.0.0.1:5000

Then in your Langroid code you can specify the LLM config using chat_model="local/127.0.0.1:5000/v1 (the v1 is the API version, which is required). As with Ollama, you can use the -m arg in many of the example scripts, e.g.

python examples/docqa/rag-local-simple.py -m local/127.0.0.1:5000/v1

Recommended: to ensure accurate chat formatting (and not use the defaults from ooba), append the appropriate HuggingFace model name to the -m arg, separated by //, e.g.

python examples/docqa/rag-local-simple.py -m local/127.0.0.1:5000/v1//mistral-instruct-v0.2
(no need to include the full model name, as long as you include enough to uniquely identify the model's chat formatting template)

Other local LLM scenarios

There may be scenarios where the above local/... or ollama/... syntactic shorthand does not work.(e.g. when using vLLM to spin up a local LLM at an OpenAI-compatible endpoint). For these scenarios, you will have to explicitly create an instance of lm.OpenAIGPTConfig and set both the chat_model and api_base parameters. For example, suppose you are able to get responses from this endpoint using something like:

curl http://192.168.0.5:5078/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Mistral-7B-Instruct-v0.2",
        "messages": [
             {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
To use this endpoint with Langroid, you would create an OpenAIGPTConfig like this:
import langroid.language_models as lm
llm_config = lm.OpenAIGPTConfig(
    chat_model="Mistral-7B-Instruct-v0.2",
    api_base="http://192.168.0.5:5078/v1",
)

Quick testing with local LLMs

As mentioned here, you can run many of the tests in the main langroid repo against a local LLM (which by default run against an OpenAI model), by specifying the model as --m <model>, where <model> follows the syntax described in the previous sections. Here's an example:

pytest tests/main/test_chat_agent.py --m ollama/mixtral
Of course, bear in mind that the tests may not pass due to weaknesses of the local LLM.