I’ve been excited about the Leela Zero project that’s doing machine learning with volunteered computers for the game Go. I’ve been running it on my Windows box for a few days but wanted to try running it on Linux. So I leased a Google Compute server with a GPU.
End result is costs about $0.10 / training game on Google Compute. It might be possible to improve this by a factor of … 4? 10? … by picking a better hardware type and tuning parameters. My Windows desktop machine with beefier hardware is using about $0.01 / game in electricity. AlphaGoZero trained itself in under 5M games. If Leela Zero does as well it’d cost well under $1M to train it up to superhuman strength. Details vary significantly though; Leela Zero is so far not learning as fast. And the time to play a game will go up as the network learns the game.
Here are my detailed notes. This is my first time setting up Google Cloud from scratch, and my first time doing GPU computation in Linux, so all is new to me. Nonetheless I got Leela Zero up and running in under an hour. One caveat: Google Cloud gives new users a $300 free trial. But you cannot apply that balance to a GPU machine. The cheapest GPU machines are about $0.50/hour.
Update: someone on Reddit notes you can get similar machines from Amazon EC2 at a $0.23/hour spot price. They have more CPU too, so maybe it gets down to $0.03/game?
Setting up a Google Cloud machine
The only subtle part of this is configuring a machine with a GPU.
- Create Google Cloud Account. Attach a credit card.
- Create a new Project
- Add SSH keys to the project
- Create a server
Compute Engine > VM instances
Create an instance. Choose OS (Ubuntu 17.10) and add a GPU. GPUs are only available in a few zones.
- I picked a K80, the cheap one; $0.484 / hour. The P100 is 3x the price.
- Got an error “Quota ‘NVIDIA_K80_GPUS’ exceeded. Limit: 0.0 in region us-west1.“
- Upgrade my account past the free trial.
- Try again, get same error
- Go to quota page. Find the “Edit Quotas” button. Request a single K80 GPU for us-west1. Have to provide a phone number. This seems to be a manual request that requires human approval, but it was approved in just a minute or two.
- Try a third time to set up a machine. Wish that my template machine had been saved. Works!
- Log in to the machine via IP address. It’s provisioned super fast, like 1 minutes. Ubuntu had already been updated.
Setting up GPU drivers
Mostly just installing a bunch of Linux packages.
- Try seeing if I can do GPU stuff already
# apt install clinfo; clinfo
Number of platforms 0
- Figure out what hardware I have
# lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
This is a late-2014 $5000 GPU. Retails for $2500 now. It’s got 24GB of VRAM in it compared to gamer cards’ 6GB or so. It’s really two GPUs in one package, but I think I’m only allowed to use one? Probing shows 12GB of GPU memory available to me.
- Follow the instructions for installing drivers from Google.
This boils down to installing the cuda-8-0 package from an NVIDIA repo. It installs a lot of crap, including Java and a full X11 environment. It took 6 minutes and my install image at the end is about 6.2B.
Note there are many other ways to install CUDA and OpenCL, I’m trusting the Google one. Also there’s a cuda-9-0 out now but I am following the guide instead.
- Enable persistence mode on the GPU. I have no idea what this means; isn’t it delightfully arcane?
# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:00:04.0.
- Verify we now have GPUs
clinfo: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by clinfo)
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 9.0.194
- Further optimize the GPU settings based on Google docs.
# sudo nvidia-smi -ac 2505,875
Applications clocks set to “(MEM 2505, SM 875)” for GPU 00000000:00:04.0
# nvidia-smi –auto-boost-default=DISABLED
Compiling and running Leela Zero
- Install a bunch of libraries. Took about a minute, 500 MB.
- Compile leelaz.
- Run leelaz. It shows you an empty Go board.
- At this point you’re supposed to hook up a fancy graphical Go client. But screw that, we’re hacking.
play black D3
Here’s a sample output from one move
- Compile autogtp
- Copy the leelaz binary into the autogtp directory
- Run autogtp. Here’s a sample output
I didn’t benchmark carefully but I put this here because it’s most likely to be of general interest. Run times are highly variable; my first game on Google Cloud took 720 seconds for 270 moves, my second game lasted 512 moves (!) and took 1028 seconds. So comparing time / game for just a few games is not useful. Perhaps the ms/move numbers are comparable or at least useful for finding optimal work settings, but even they seem highly variable in ways I can’t understand. Benchmarking this for real would take a more serious effort.
- Google Compute. 1 vCPU, K80 CPU.
1 thread: 63% GPU utilization, 800 seconds / game, 2650 ms / move
2 threads: 69% GPU utilization, 1200 s/game, 2500 ms/move.
4 threads: 80% GPU utilization, 1113 s/game, 2110 ms/move
10 threads: 57% GPU utilization, ?
- Windows desktop. i7-7700K (4 cores), GTX 1080 GPU
1 thread: 612 s/game, 1700 ms/move, 30% GPU
2 thread: 389 s/game, 1120 ms/move
3 threads: 260 s/game, 740 ms / move. 70% GPU
4 threads: 286 s/game, 701 ms / move. 84% GPU
5 threads: 291 s/game, 750 ms / move
8 threads: 400 s/game, 740 ms / move, 80% GPU
Bottom line, I’d say the Google Compute systems are roughly 800 seconds / game, or 5 games an hour. That pencils out to about $0.10 a game. My Windows box with better hardware is about 3-4 times faster. I’m guessing it uses about 200W (didn’t measure), which is about $0.08 / hour or < $0.01 / game in electricity costs.
I’m confused about the performance numbers. I think those are normalized by number of simultaneous games (the -g parameter), so lower numbers are always better. The seconds/game number is highly volatile though since game length varies so much. I guess the ms/move parameter goes down on the Linux box with more threads because we use more of the GPU? But why not the same pattern on Windows? FWIW the program author has noted Windows performance is not great.
Leelaz seems to only use 63MiB of GPU memory, so a very low RAM graphics card is probably fine.
One last thing: I’ve been running the Windows .exe binary under Windows Services for Linux. 4 threads in this environment is 710 ms / move. 4 threads in a DOS window is 777 ms / move. Not sure it’s significant, but seemed worth noting.