Using Langroid with Local LLMs
Why local models?¶
There are commercial, remotely served models that currently appear to beat all open/local models. So why care about local models? Local models are exciting for a number of reasons:
- cost: other than compute/electricity, there is no cost to use them.
- privacy: no concerns about sending your data to a remote server.
- latency: no network latency due to remote API calls, so faster response times, provided you can get fast enough inference.
- uncensored: some local models are not censored to avoid sensitive topics.
- fine-tunable: you can fine-tune them on private/recent data, which current commercial models don't have access to.
- sheer thrill: having a model running on your machine with no internet connection, and being able to have an intelligent conversation with it -- there is something almost magical about it.
The main appeal with local models is that with sufficiently careful prompting, they may behave sufficiently well to be useful for specific tasks/domains, and bring all of the above benefits. Some ideas on how you might use local LLMs:
- In a multi-agent system, you could have some agents use local models for narrow tasks with a lower bar for accuracy (and fix responses with multiple tries).
- You could run many instances of the same or different models and combine their responses.
- Local LLMs can act as a privacy layer, to identify and handle sensitive data before passing to remote LLMs.
- Some local LLMs have intriguing features, for example llama.cpp lets you constrain its output using a grammar.
Running LLMs locally¶
There are several ways to use LLMs locally. See the r/LocalLLaMA
subreddit for
a wealth of information. There are open source libraries that offer front-ends
to run local models, for example oobabooga/text-generation-webui
(or "ooba-TGW" for short) but the focus in this tutorial is on spinning up a
server that mimics an OpenAI-like API, so that any code that works with
the OpenAI API (for say GPT3.5 or GPT4) will work with a local model,
with just a simple change: set openai.api_base
to the URL where the local API
server is listening, typically http://localhost:8000/v1
.
There are a few libraries we recommend for setting up local models with OpenAI-like APIs:
- LiteLLM OpenAI Proxy Server lets you set up a local proxy server for over 100+ LLM providers (remote and local).
- ooba-TGW mentioned above, for a variety of models, including llama2 models.
- llama-cpp-python (LCP for short), specifically for llama2 models.
- ollama
We recommend visiting these links to see how to install and run these libraries.
Use the local model with the OpenAI library¶
Once you have a server running using any of the above methods,
your code that works with the OpenAI models can be made to work
with the local model, by simply changing the openai.api_base
to the
URL where the local server is listening.
If you are using Langroid to build LLM applications, the framework takes
care of the api_base
setting in most cases, and you need to only set
the chat_model
parameter in the LLM config object for the LLM model you are using.
See the Non-OpenAI LLM tutorial for more details.