Nelson's log

Code in The OpenAddress Machine

I’m trying to help with the OpenAddress project, particularly Mike Migurski’s “machine” code for running a full analysis of the data. Particularly particularly the ditch-node branch. Managed to get the tests running after some version hassle.

Now I’m taking inventory of what all code is in the repo to understand where to work. These notes are probably only useful right about now (December 2014) and only for the ditch-node branch.

openaddr/*.py

This is the big thing, about 1700 lines of Python code. The architecture is a bit confusing. I don’t really understand how it’s managing threads / processes for parallelism. Also the actual ETL code is a bit convolved with the download and cache code, including S3.

openaddr-*

Not part of the Machine repo, these are four small Node projects that contain code for managing OpenAddresses data. This is the code that’s being “ditched” in the branch, and in fact I think it’s no longer used outside the tests.

test.py, tests/

Test suite for The Machine. All the code is currently in test.py, just 11 tests. tests/ is test data files.

chef/

Scripts to set up an Ubuntu server to run The Machine. I haven’t actually run these on my dev box, just copied out bits of it as necessary. Some of the things it does: