Running my own LLM

No original work here, just summarizing how astonishingly easy it is to install and run an LLM on your own computer using Simon Willison’s fantastic llm tool. Simon has been on an absolute tear with tooling for LLMs lately, amazing work. A bit hard to keep up but between his Mastodon feed and his blog I can sort of follow along. See his list of plugins for a sense of how broad the tool is.

I think this current moment in AI development is amazing and fascinating. And it’s great so many models can be downloaded for free experimentation. Simon’s tool makes this easily accessible.

Getting started

Here’s what I did to host my own LLM in just a few minutes:

pip install llm
llm install llm-gpt4all
time llm -m orca-mini-3b ‘what is the capital of france’
The capital city of France is Paris.
real 0m2.337s user 0m7.701s sys 0m0.168s

That’s it, literally three commands. All automagic. The slowest part is downloading the 4GB of weights for Orca. Simon did a fantastic job with this tool. See the detailed docs or read on for my usage notes.

Technical details

llm was originally designed as a front-end for the OpenAI API but has a plugin architecture with lots of plugins. There are some other plugins for hosted models like Google Palm, Replicate, Claude. But I’m more interested in running the model locally. My example above uses llm-gpt4all, which downloads and runs GPT models downloadable in some standard format. (There’s some 13 models there, most are about 7GB big.) There’s also hosted plugins for Llama, MLC, and Mosaic.

I installed this on my year-old Linux server. It’s middle-of-the-road home system hardware, notably with no external GPU. So not a great match for AI but then I don’t have to sweat the details of making GPU access work on Linux. When running the GPT models they seem to be using 6 or 8 threads, which suggests they want real cores and not hyperthreads. Not sure how much RAM it uses; the docs say 4GB for the 1.8GB Orca model but I’m observing more like 1GB.

Performance of the smallest Orca model is fast enough on my CPU to be interactively usable; 2-3 seconds for short queries. Some others are significantly slower; I have to think a lot of these are designed to run on a GPU.

The main command is llm prompt for querying the LLM. It has a -c flag for an interactive session, similar to ChatGPT’s web interface. Otherwise each question starts from a blank slate.

Each model comes with a system prompt baked in that primes the LLM for your query. Ie “You are an AI assistant that follows instruction extremely well. Help as much as you can.” You can customize this; for instance Simon advises telling Llama “you are funny” if you want jokes.

Not sure where all LLM is storing its model files, they aren’t in the Python venv. Some of mine are in ~/.config/io.datasette.llm. Another place is ~/.cache/gpt4all. These model files are huge, often over 10GB each, so if your home directory is backed up over a network or something beware.

In summary, Simon’s LLM tool is a great frontend driver for various LLMs. It handles installing plugins to drive the models, downloading models themselves (for local hosting) and key management (for remotely hosted AIs). It also has some nice features for tracking your usage like llm logs for access to a database of your chat history.

Some model outputs

Some quick notes from me trying all the self-hosted models with the test query “what is the capital of france”. Note I tried several versions of LLaMA 2 using different llm plugins. All timings are based on using a simple Intel CPU; results should be much faster with a working GPU.

gpt4all results

These are all using the gpt4all plugin, which seems to run best on my system. All of these give a short output of “The capital city of France is Paris.” or something very similar.

orca-mini-3b. 2.4 seconds.
orca-mini-7b. 4.2s.
orca-mini-13b. 7.6s.
ggml-model-gpt4all-falcon-q4_0. 3.9 seconds.
wizardlm-13b-v1. 5s. (also uncensored).
nous-hermes-13b 5s.
GPT4All-13B-snoozy 5s.
llama-2-7b-chat. 10s

One reason these are so fast is the output is so short, presumably an artifact of how the LLM was set up by the plugin. (The prompt is “Human: {{{user input here}}}. Assistant:” If I ask orca-mini-3b “what is the capital of france and tell me a bit about it” it generates a full paragraph and takes 9 seconds instead of 2.6.

The capital city of France is Paris, also known as “The City of Lights”. It is located in the northern central part of France and is the country’s largest city with over 2.1 million people living there. Paris is famous for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, the Louvre Museum, and the Arc de Triomphe. The city is also known for its fashion industry, art galleries, restaurants, cafes, and shopping centers. Paris is a popular tourist destination and attracts millions of visitors every year.

Other LLM plugins

The other plugins run slower, but that might be because they are giving larger outputs.

MLC llama-2-7b. 26 seconds. Something weird in htop, it doesn’t look like it’s using any CPU?!
Thank you for asking! The capital of France is Paris. I’m glad you asked, as it’s a beautiful and vibrant city that offers a lot of interesting places to visit and things to do. If you have any other questions or need further assistance, feel free to ask!
llama-cpp llama-2-7b-chat.ggmlv3.q8_0. 22 seconds.
The capital of France is Paris. It is located in the northern central part of the country and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum.
mpt-30b. 75 seconds.
The capital of France is Paris. It is located in the northern central part of the country and is known for its famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. With a population of over 2 million people, Paris is one of the most populous cities in Europe. It is also known for its art, fashion, cuisine, and cultural heritage, making it one of the most popular tourist destinations in the world.

Specialist output

These models seem to be giving some unique output, not general chat.

gpt4all Bert (ggml-all-MiniLM-L6-v2-f16). 0.8s. I think it’s not intended for direct use.
!!!!!!!!!!!!!!!!!!!…
gpt4all starcoderbase-7b-ggml. 34s. One of several models specifically for generating code.
def get_capital(country): …
gp5rall markov. 2.4s. This is a simple Markov chain as a demo of how to write a plugin, not a serious tool for use.
is the capital of france france the capital of france what is the capital of france the capital of france…

Daily usage

Now when I want to ask an AI a question I’ll simply do llm 'my question here'. No more thought than that required, answers in about 10 seconds depending on the complexity.

I went ahead and ran llm models default wizardLM-13B-Uncensored to pick a specific default model. Why that one? Simon’s notes say Wizard is the best choice and it ran fast enough for me. Not sure what all the “uncensored” means but that seemed like a novelty I could try that’s not available at the online services. I suspect it isn’t adding anything of value though.

Nelson's log

A personal work journal