Some random home-scale LLM facts

I’ve been learning a lot about small-scale LLM models running on your desktop by reading /r/localllama. None of this is novel, just absorbing common wisdom. I’m linking to some sources but these are by no means authoritative, just useful for feels.

They tend to talk about models in terms of number of parameters.
LLaMa 2, which many models are derived from, comes in 7B, 13B, and 70B sizes. 70B means 70 billion 16 bit weights. They were trained on 2 trillion tokens as input.
Number of parameters dictates memory needed. LLaMa 2’s 70B is 140GB. But folks often downsample to 4 bits when running them so it’s just 35GB. (Or less?)
Desktop CPU RAM is fairly cheap (up to 128GB), but VRAM is expensive if you’re running on a GPU. The most expensive gamer GPU (an NVidia 4090 at ~$2000) has 24GB of VRAM.
You can also chain multiple cheaper GPUs to get more total VRAM. Beware a single 3090 takes three motherboard slots, so you need a special motherboard and case to have room for two.
Specialty AI chips like the A100 or H100 support 40-100ish GB per GPU.
70B is said to be enough to approach ChatGPT 4.0 levels of quality but I have my doubts.
The GPT4All models I’m running just on my CPU with GPT4All are 6-13B.
GPT4All uses GGML / llama.cpp to run things. it has various quantizing algorithms, it’s not just “truncate the lower bits”. That driver targets 4, 5, and 8 bit integer quantization.
GGML can offload part of the model to run on the GPU and use CPU for the rest. This lets you run bigger models than can fit just on the GPU.
This discussion about folks running LLMs on an NVidia 4090 with an Intel i9 is interesting. That’s the highest end reasonable home hardware, maybe $4000ish for a full system? They talk about running 70B models at speeds of 1-2 tokens / second. Which is not fast but maybe acceptable for an interactive session.

Update

Some new facts:

Running a 70B model (at 4 bit quant?) at home takes 48GB of VRAM. Folks are using two high end GPUs like 3090s or 4090s. Talk of generating ~15 tokens / second. At the low end, two Tesla P40s will work and are cheap.
Running on your own leased cloud computer is totally reasonable. See this discussion and this one. It may cost about $1500 a year to lease, compared to $4000+ to buy hardware.
Inference can use up to 700W of GPU, Call it 100W for the computer and you get to about 35¢ an hour just for PG&E electricity. OTOH in casual use you’re not running inference full-out, so actual power will be much less.
RunPod keeps coming up in discussions about cloud providers.
Folks say it costs about 60¢ an hour to run a 70B model on a cloud server. Apparently this can be done for a single user without too much drama. Scaling that up to 100+ is going to run into availability problems.

One thought on “Some random home-scale LLM facts”

lhl says:

2023/08/27 at 10:20

Hey Nelson, feel free to drop a message (I’m in the r/ll discord more than the subreddit) if you have any questions. For 2 x 4090s you should be getting 20+ t/s, but you can get 12-15 w/ 2 x 3090s for half the price (or just spin up a cloud GPU for $2-4/hr and probably come out even cheaper.

Comments are closed.

Nelson's log

A personal work journal

Some random home-scale LLM facts

Update

Related

One thought on “Some random home-scale LLM facts”