I picked up some Unix wisdom years ago from a greybeard to ignore the Unix “load average”. It’s just not a very precise measure of how busy a system is. That was particularly true in the Ultrix 2.2 systems I first learned on. The stupid thing counted zombie processes for calculating the load, and zombies take essentially no resources.
You’re always better off looking more closely at contention for specific resources in your system: CPU, RAM, Disk IO, Network IO. But the load average is one handy little number, what does it mean? I’ll focus on the first number, the 1 minute average. (The time averages are weird exponential decay things, I don’t understand them.)
The official docs say load average is “the number of jobs in the run queue (state R) or waiting for disk I/O (state D)”. So it’s a measure of CPU and Disk contention. The kernel source has a remarkably clear comment, that it’s the number of CPUs that are either running or uninterruptible. That specifically implies it’s the number of CPUs actually tied up doing some CPU task. I don’t really understand uninterruptible in the kernel other than it happens when you wait for disk. Not really a measure of disk bandwidth as disk wait. I also think this description means that time spent in the kernel itself (sys) is counted towards the load average.
So load average is a good measure of whether your system doesn’t have enough CPUs to do the work at hand. Aiming to keep load_avg = num_cpus on a long term average should maximize CPU throughput.
However a lot of our real work is network-bound, either Internet downloads or local network waiting for something like a database to answer. So for many kinds of work we want to run more simultaneous jobs than num_cpus. But a job waiting on the network contributes close to 0 to load average, so the heuristic of aiming for load_avg = num_cpus is not a bad one.
It’s still useful to look more closely at detailed system resource usage metrics. htop and vmstat are great for inspecting memory usage. I like iftop for network. iostat is the traditional tool for disk IO.
I’m running an OpenAddresses run right now with 32 processes on an 8 CPU machine. There’s a lot of network waiting, but num_cpus * 4 is too much. Load Average has been about 20. In previous runs with 16 processes the load average has been about 8–10.
Some useful extra reading: