After a couple of weeks of steady work the Python code to process OpenAddresses is nearly feature complete. I did a big run last night over all the sources we have. It went well!
- 608 input sources: 402 out.csv files, 372 sample.json files
- 80M out.csv rows. (Compare 100M from the Node code)
- 3 hours, 5 minutes running time
- < 16G RAM running 8x parallel
- 11.5G of useful output files
- 100+G disk used (temp files left on error)
- 7G of cached source data: 3.5G zip, 3.5G json
- 150Wh = $0.50 of electricity
I ran the whole thing with 8x parallelism using GNU Parallel on my Linux box. The box is nothing special; a quad core i7 with 16G of RAM and a very ordinary 7200RPM hard drive. Home network is 100Mbit for downloads.
I think disk contention was the primary bottleneck, an SSD might make a big difference. So would keeping data in RAM while we process it, I think I overwhelmed the disk cache. FWIW the disk has a benchmarked performance of about 120 MBytes/second. But having listened to the thing, there was an awful lot of seeking going on. Parallelism isn’t friends with a single spinning drive.
Below are some munin graphs for the day of the run. There’s two runs here; an aborted run to the disk sda (which filled), then a successful run around 04:30 to 07:30 to the disk sdb. Not sure the graphs have a whole lot of information. They do show some serious IO contention; the CPU graph shows as much time spent in IO wait as doing actual work and disk latencies hit 500ms for awhile there, ugh. The memory graph is too messy to read but I do know the system never needed to swap.