Nelson's log

GNU parallel for openaddr-process-one

I’ve been running batches of OpenAddress conversion jobs for bulk testing with a simple shell for loop


for f in us-ca-*.json; do
  openaddr-process-one "$f" /tmp/oa
done

But it’s better to do this in parallel, particularly since the work is a nice mix of CPU, network, and disk IO. Below is my first ever GNU parallel script to do that.

How much parallel? My system has 16 gigs of RAM and 4 honest CPUs (looks like 8 cores because of hyperthreading). I originally tried running “–jobs 200%” but because of the hyperthreading that results in 16 jobs which feels like a lot. So I’m keeping it at 8.

With 8 jobs in parallel we process 29 files in about 14 minutes. Running that in serial took 48 minutes. Los Angeles County alone takes 11 minutes and Solano County takes 10, so a total of 14 minutes to run 29 files sounds pretty good to me.

us-ca-los_angeles_county downloads fast, all its time is being spent in my Python conform code. There’s 3M rows so it’s no wonder it’s slow, but that seems too slow. Good place to focus some code optimization effort.

Solano county takes 10 minutes to download. It’s one of those ESRI webservice things, 391 requests for 500 records at a time and each one takes 1 or 2 seconds. I’ve been thinking we ought to have a separate pre-caching process for the ESRI combining, store the GeoJSON ripped data in S3 somewhere. There’s no way to test for data changes on the ESRI services so we can’t cache if we go direct, this way we could insulate the rest of the machine from the slowness of the services and the caching problems.

We wrote the code to be careful about RAM, processing everything as streams. Most jobs only take a polite 70M to run, but us-ca-kern, us-ca-marin_county, and us-ca-solano all mapped 1G+. Those are all ESRI type services, maybe the code is buffering all the data in RAM? Need to look more closely.

I suspect the main point of contention for running in parallel is disk IO. There’s a lot of file writing and copying; my own conform code basically writes the whole dataset twice. In theory I could get that down to once but it makes the code significantly more complicated.

Running this in parallel like this makes me realize there’s still work to do on the logging configuration. I want to be able to say “log only to a file named on the command line with a configuration specified in a file named here”; my logging config code doesn’t allow that. I can use the command line config to log to a file, then we just shove all of stderr to /dev/null. It’d be nicer to ask parallel drop stderr from the jobs its running but I couldn’t figure out how to do that. Even better would be to make it so I can configure the Python logging the way I want.


#!/bin/bash

# Quickie script to run a bunch of openaddr-process-one jobs in parallel

srcdir=/home/nelson/src/oa/openaddresses-me/sources
out=/tmp/oa
sources='us-ca-*.json'

# Only try to process sources with a conform
sources=$(cd "$srcdir"; grep -l 'conform' $sources)
sources=$(echo "$sources" | sed s/\.json$//g)
mkdir -p "$out"

# Echo a bunch of commands to run; these could be piped in to shell to run sequentially
#for s in $sources; do
#    echo openaddr-process-one -v -l "$out/$s.log" "$srcdir/$s.json" "$out"
#done

start=`date +%s`
parallel -u --jobs 8 openaddr-process-one -v -l "$out/{}.log" "$srcdir/{}.json" "$out" ::: $sources 2> /dev/null
end=`date +%s`
echo `echo $sources | wc -w` jobs run in $[end-start] seconds