Setting up a local LLM to work with Langroid¶
Examples scripts in examples/
directory.
There are numerous examples of scripts that can be run with local LLMs,
in the examples/
directory of the main langroid
repo. These examples are also in the
langroid-examples
,
although the latter repo may contain some examples that are not in the langroid
repo.
Most of these example scripts allow you to specify an LLM in the format -m <model>
,
where the specification of <model>
is described in the quide below for local/open LLMs,
or in the Non-OpenAI LLM guide. Scripts
that have the string local
in their name have been especially designed to work with
certain local LLMs, as described in the respective scripts.
If you want a pointer to a specific script that illustrates a 2-agent chat, have a look
at chat-search-assistant.py
.
This specific script, originally designed for GPT-4/GPT-4o, works well with llama3-70b
(tested via Groq, mentioned below).
Easiest: with Ollama¶
As of version 0.1.24, Ollama provides an OpenAI-compatible API server for the LLMs it supports, which massively simplifies running these LLMs with Langroid. Example below.
This provides an OpenAI-compatible server for themistral:7b-instruct-v0.2-q8_0
model.
You can run any Langroid script using this model, by setting the chat_model
in the OpenAIGPTConfig
to ollama/mistral:7b-instruct-v0.2-q8_0
, e.g.
import langroid.language_models as lm
import langroid as lr
llm_config = lm.OpenAIGPTConfig(
chat_model="ollama/mistral:7b-instruct-v0.2-q8_0",
chat_context_length=16_000, # adjust based on model
)
agent_config = lr.ChatAgentConfig(
llm=llm_config,
system_message="You are helpful but concise",
)
agent = lr.ChatAgent(agent_config)
# directly invoke agent's llm_response method
# response = agent.llm_response("What is the capital of Russia?")
task = lr.Task(agent, interactive=True)
task.run() # for an interactive chat loop
Setup Ollama with a GGUF model from HuggingFace¶
Some models are not directly supported by Ollama out of the box. To server a GGUF model with Ollama, you can download the model from HuggingFace and set up a custom Modelfile for it.
E.g. download the GGUF version of dolphin-mixtral
from
here
(specifically, download this file dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
)
To set up a custom ollama model based on this:
- Save this model at a convenient place, e.g.
~/.ollama/models/
- Create a modelfile for this model. First see what an existing modelfile for a similar model looks like, e.g. by running:
FROM ...
line with the path to the model you downloaded, e.g.
- Save this modelfile somewhere, e.g.
~/.ollama/modelfiles/dolphin-mixtral-gguf
-
Create a new ollama model based on this file:
-
Run this new model using
ollama run dolphin-mixtral-gguf
To use this model with Langroid you can then specify ollama/dolphin-mixtral-gguf
as the chat_model
param in the OpenAIGPTConfig
as in the previous section.
When a script supports it, you can also pass in the model name via
-m ollama/dolphin-mixtral-gguf
Setup llama.cpp with a GGUF model from HuggingFace¶
See llama.cpp
's GitHub page for build and installation instructions.
After installation, begin as above with downloading a GGUF model from HuggingFace; for example, the quantized Qwen2.5-Coder-7B
from here; specifically, this file.
Now, the server can be started with llama-server -m qwen2.5-coder-7b-instruct-q2_k.gguf
.
In addition, your llama.cpp
may be built with support for simplified management of HuggingFace models (specifically, libcurl
support is required); in this case, llama.cpp
will download HuggingFace models to a cache directory, and the server may be run with:
llama-server \
--hf-repo Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
--hf-file qwen2.5-coder-7b-instruct-q2_k.gguf
To use the model with Langroid, specify llamacpp/localhost:{port}
as the chat_model
; the default port is 8080.
Setup vLLM with a model from HuggingFace¶
See the vLLM docs for installation and configuration options. To run a HuggingFace model with vLLM, use vllm serve
, which provides an OpenAI-compatible server.
For example, to run Qwen2.5-Coder-32B
, run vllm serve Qwen/Qwen2.5-Coder-32B
.
If the model is not publicly available, set the environment varaible HF_TOKEN
to your HuggingFace token with read access to the model repo.
To use the model with Langroid, specify vllm/Qwen/Qwen2.5-Coder-32B
as the chat_model
and, if a port other than the default 8000 was used, set api_base
to localhost:{port}
.
Setup vLLM with a GGUF model from HuggingFace¶
vLLM
supports running quantized models from GGUF files; however, this is currently an experimental feature. To run a quantized Qwen2.5-Coder-32B
, download the model from the repo, specifically this file.
The model can now be run with vllm serve qwen2.5-coder-32b-instruct-q4_0.gguf --tokenizer Qwen/Qwen2.5-Coder-32B
(the tokenizer of the base model rather than the quantized model should be used).
To use the model with Langroid, specify vllm/qwen2.5-coder-32b-instruct-q4_0.gguf
as the chat_model
and, if a port other than the default 8000 was used, set api_base
to localhost:{port}
.
"Local" LLMs hosted on Groq¶
In this scenario, an open-source LLM (e.g. llama3.1-8b-instant
) is hosted on a Groq server
which provides an OpenAI-compatible API. Using this with langroid is exactly analogous
to the Ollama scenario above: you can set the chat_model
in the OpenAIGPTConfig
to
groq/<model_name>
, e.g. groq/llama3.1-8b-instant
.
For this to work, ensure you have a GROQ_API_KEY
environment variable set in your
.env
file. See groq docs.
"Local" LLMs hosted on Cerebras¶
This works exactly like with Groq, except you set up a CEREBRAS_API_KEY
environment variable, and specify the chat_model
as cerebras/<model_name>
, e.g. cerebras/llama3.1-8b
. See the Cerebras docs for details on which LLMs are supported.
"Local" LLMs hosted on GLHF.chat¶
See glhf.chat for a list of available models.
To run with one of these models, set the chat_model
in the OpenAIGPTConfig
to
"glhf/<model_name>"
, where model_name
is hf:
followed by the HuggingFace repo
path, e.g. Qwen/Qwen2.5-Coder-32B-Instruct
, so the full chat_model
would be
"glhf/hf:Qwen/Qwen2.5-Coder-32B-Instruct"
.
Other non-OpenAI LLMs supported by LiteLLM¶
For other scenarios of running local/remote LLMs, it is possible that the LiteLLM
library
supports an "OpenAI adaptor" for these models (see their docs).
Depending on the specific model, the litellm
docs may say you need to
specify a model in the form <provider>/<model>
, e.g. palm/chat-bison
.
To use the model with Langroid, simply prepend litellm/
to this string, e.g. litellm/palm/chat-bison
,
when you specify the chat_model
in the OpenAIGPTConfig
.
To use litellm
, ensure you have the litellm
extra installed,
via pip install langroid[litellm]
or equivalent.
Harder: with oobabooga¶
Like Ollama, oobabooga/text-generation-webui provides an OpenAI-API-compatible API server, but the setup is significantly more involved. See their github page for installation and model-download instructions.
Once you have finished the installation, you can spin up the server for an LLM using something like this:
python server.py --api --model mistral-7b-instruct-v0.2.Q8_0.gguf --verbose --extensions openai --nowebui
http://127.0.0.1:5000
Then in your Langroid code you can specify the LLM config using
chat_model="local/127.0.0.1:5000/v1
(the v1
is the API version, which is required).
As with Ollama, you can use the -m
arg in many of the example scripts, e.g.
Recommended: to ensure accurate chat formatting (and not use the defaults from ooba), append the appropriate HuggingFace model name to the -m arg, separated by //, e.g.
(no need to include the full model name, as long as you include enough to uniquely identify the model's chat formatting template)Other local LLM scenarios¶
There may be scenarios where the above local/...
or ollama/...
syntactic shorthand
does not work.(e.g. when using vLLM to spin up a local LLM at an OpenAI-compatible
endpoint). For these scenarios, you will have to explicitly create an instance of
lm.OpenAIGPTConfig
and set both the chat_model
and api_base
parameters.
For example, suppose you are able to get responses from this endpoint using something like:
curl http://192.168.0.5:5078/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Mistral-7B-Instruct-v0.2",
"messages": [
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
OpenAIGPTConfig
like this:
import langroid.language_models as lm
llm_config = lm.OpenAIGPTConfig(
chat_model="Mistral-7B-Instruct-v0.2",
api_base="http://192.168.0.5:5078/v1",
)
Quick testing with local LLMs¶
As mentioned here,
you can run many of the tests in the main langroid repo against a local LLM
(which by default run against an OpenAI model),
by specifying the model as --m <model>
,
where <model>
follows the syntax described in the previous sections. Here's an example: