Nelson's log

OpenAddress Machine job management

Our OpenAddresses system wants to run N jobs at once while downloading and processing our ~1000 sources. Mostly to do a bunch of network requests to various servers in parallel, but also for some CPU parallelism in multi-core systems. Normal jobs use very little RAM (100M?) but occasionally a big GeoJSON file blows up to 2G or 4G. Here’s 4 options for job management.

Unix processes. I’ve been running stuff in GNU parallel, forking off a new Python process for each job. 8 simultaneous processes on a 8 CPU system. That’s worked pretty well, see previous blog posts for notes. CPU usage around 250%, memory usage all over the place but generally under 2G. Full job finishes in 3, 4 hours.

Python threads. I just tried openaddr-process for the first time to run a full go (on the same Linux box). That code works by spawning 16 threads. So one big Python process. That’s not working quite so well. Running at a steady 120% CPU; because of Python’s GIL we can’t get more parallelism. The process also has 11G resident. I suspect much of that could be reclaimed, probably left over from when 2 or 3 big GeoJSON jobs ran at once. But that stresses the garbage collector; there’s something to be said about killing a process. (Update: the resident set size got as big as 17G at one point, enough to force swapping. It didn’t seem to thrash though. Also Python is releasing the memory again later, the RSS has gotten as small as 9G again.)

Python multiprocessing. I want to try this. It should be a relatively easy change from threading, but I am hoping gives some CPU parallelism and memory housekeeping. Also we need a way to kill jobs asynchronously and Python’s threading module is designed to make that possible. With multiprocessing you just send a SIGTERM and are done.

Decoupled task architecture. openaddr-process is structured to do the job as one big long-lived Python process. Even with multiprocessing, when an individual job finishes it’s passing Python objects back to the task manager, which waits 3+ hours to collect them all before formatting the crucial report. This is brittle. Long term I’d like to redesign the system to be more decoupled, so that individual jobs run as completely separate processes and autonomously store state in some persistent storage; files and maybe a row or two in a database. Then the overall Machine monitor can just fork off jobs as it needs and collect data off the database. That’s much more like a task queue architecture, a robust thing. I know how to build this with my own servers, less clear how to match it to the EC2 model of cloud computing.