Migurski and I have been working up a new architecture for OpenAddresses, specifically the python code that runs source jobs. Currently everything runs as a big batch job every two days. The primary goal is to get a system that can run a new submitted source instantly, ideally with useful feedback to the submitter.
Mike’s insight was to build a system a little like Travis; something that looks at GitHub commits with a hook, runs a new job every time something is committed, and posts back info via the GitHub status API. GitHub is nice because it provides a file store and change tracking. Ultimately we don’t want to require a GitHub login to submit sources, and we probably want a custom UI, but that can come later.
Here’s the components we’re talking about building:
- A GitHub webhook that submits info to our new job server when stuff is checked in.
- A new job server that accepts HTTP posts of GitHub events. Mike’s got one working in Flask running on Heroku.
- A job queue. New jobs get enqueued by the new job server, then some worker daemon process checks for jobs to run and executes them. Can be a very simple queue, but needs to support job priorities (interactive vs. batch). Right now I’m thinking of writing my own using Postgres running in Amazon RDS, but there’s got to be something I can just use that’s not super overkill. (Update: pq looks good.)
- Workers to run jobs. Something to look for new jobs and execute them. Current thought is a persistent Unix server running on EC2. Installs openaddr and its dependencies via pip, then runs jobs via openaddr-process-one, then posts results back to the job queue result queue.
- Job completion notifications back to GitHub using the GitHub status API. Mike has been tinkering with code to do this already.
- A batch-mode refresher. Something to run sources periodically even if no one’s edited the source specification in awhile. Mostly to catch new data on the remote server.
Here’s some open things to design.
- Job status reporting with humane UI. Currently we get very little info back from when a job is run. A few bits of status (did it complete, how long it took), a debug log from the process, and any actual output files. We need more. In particular I think the stuff you currently learn from the debug log needs to be expanded to useful user-level event data. Ie, events like “source JSON parsed”, “source data downloaded”, etc.
- Security. A bit nervous running completely unreviewed source specs from anonymous people. Isolating the worker in a throwaway machine might help. Or doing some sanitization on inputs.
- Statistics reporting. I have a dashboard that works OK for the existing nightly batch mode, but we’ll want something different for this continuous mode thing.
See also OpenAddresses Mark 3, a proposal I sketched in January. What we’re doing here is similar but without the component of breaking any individual source run up into separate tasks.