OpenAddress Machine mark 3

this post is a work in progress

Motivation

When I’m back from India late February I’d like to re-design OpenAddress machine to a more decoupled architecture. Right now the system runs one big Python process which runs all sources from download all the way to output CSV. (There is some state carried over from run to run, but not much.) It works pretty well and is simple, and a full run just takes 2–3 hours. But the multiprocessing stuff is brittle and the whole system is both overkill (most sources never change, why reprocess?) and also non-responsive (no fast way to test a new source).

Proposal

Move to a decoupled architecture where tasks are run as fully independent Unix processes. They share state via some centralized database / cloud storage thingy, and also post results in S3 for web serving.

As with the current architecture, the unit of work in the job system is a single source file such as us-ca-san_francisco. It runs as an independent Unix process with no awareness of other sources. It downloads, conforms, summarizes, reports, and uploads its results to the database and to S3.

In addition we also need a task queueing system, something that decides when to run source jobs and posts them on the task queue. That process works mostly by reading the database and deciding what work needs doing. Some rules: “immediately process a new source”. “Check a source weekly for new data and reprocess”. “Clear the whole queue and re-process it”.

Finally we need a task executor that looks at the queue and launches tasks as needed. This could be part of the task queuer, but it helps to think of it as a separate entity. When a task is finished it puts the result on the task record log.

Database implementation

I’m totally agnostic about the data store we use to track runs. It’s a very low demand system just a few hundred megs of data and a handful of transactions a minute. It will require something with consistency and responsiveness. The simple choice is a PostGIS database running persistently on some 24/7 server. I’m not sure what the cloud services equivalent would be, so running with the PostGIS idea, here’s some schema sketches:

Completed task records

  • Task ID
  • Time started
  • Time finished
  • Source name
  • Success / failure
  • S3 output paths: out.csv result, debug logs, etc
  • Whatever else write_state does.

Task queue

  • There’s a zillion ways to do a task queue in a database. Rows are placed in the table by the task queuer, then removed by the task launcher when they are dispatched.

Task implementation

A task should be a single Python program. It takes a source.json specification file and any job metadata that’s needed (hopefully none?) and executes it. For debugging purposes this program should be able to be run standalone on a developer’s machine with no dependencies on the queuing system, the task manager database, or S3.

These jobs will be a natural to run on cheap EC2 spot instances. Right now tasks take 0.5 – 90 minutes to run. A bit wasteful to spin up a whole EC2 instance for 30 seconds of work, but maybe that’s OK. We could also run tasks on a single workhorse machine, that’s effectively what we’re doing now. A single mid-level Linux box can run 16 tasks in parallel. (In fact parallelism is good; the jobs are a mix of load types.)

Alternate: subtasks

This proposal assumes that a whole source from start to finish is run as a single Python process. But in fact processing a source consists of several sub tasks, and it might be good to run them as separate sub-tasks. Here’s what the subtasks are:

  • Download from source URL / ESRI store
  • Sample the source data for user inspection
  • Convert the source JSON/CSV/SHP data to an extracted.csv file
  • Conform the extracted.csv data to OpenAddresses out.csv format
  • Upload out.csv and job data to S3

There’s times i’ve wanted to be able to execute these subtasks separately. Particularly the download part, that is slow and unreliable. To some extent the current data caching strategy in Mark 1 and 2 is dealing with that, and it may be sufficient. But you could imagine breaking every source into 5 tasks and running them separately on a job queue. Worth considering.

History

Why “mark 3”?

OpenAddress Machine mark 1 was Mike’s original work, wrapping the Node code in a bunch of Python scripts to do regular runs and post results in a web dashboard. It is awesome.

We’ve been working on mark 2, the “ditch-node” branch, where we rewrote all the Node code in Python. The overall job architecture is about the same as mark 1, the management of downloads and files and stuff. It’s one big Python process with jobs run in parallel using multiprocessing. We did make some changes to how tasks are conceived and run.

Updates

Some notes after this post was written

  • Ian notes that GitHub triggers and Jenkins could trigger a build on a new source