I’m trying to help with the OpenAddress project, particularly Mike Migurski’s “machine” code for running a full analysis of the data. Particularly particularly the ditch-node branch. Managed to get the tests running after some version hassle.
Now I’m taking inventory of what all code is in the repo to understand where to work. These notes are probably only useful right about now (December 2014) and only for the ditch-node branch.
This is the big thing, about 1700 lines of Python code. The architecture is a bit confusing. I don’t really understand how it’s managing threads / processes for parallelism. Also the actual ETL code is a bit convolved with the download and cache code, including S3.
- __init.py__: grab bag of top level stuff. Key methods:
- cache() and conform() are wrappers around Python code to download and process data from sources. This code wraps Python modules and uses multiprocessing.Process to run them as separate jobs. (see also jobs.py) The comments say these are Node wrappers, but they no longer are in this branch.
- excerpt() pulls bits of data out of source files for user presentation. Contains some logic for parsing CSV, Shapefiles, etc to get a few rows out of them. See also sample.py. Excerpt does not use Process.
- cache.py: Download from HTTP, FTP, and ESRI sources. I believe this code entirely writes to local disk and does not use S3. The ESRI code has logic about doing database-like queries on URLs and rewriting the ESRI proprietary JSON to GeoJSON
- conform.py: python code to translate source data to the OpenAddress CSV output schema. Uses the OGR Python bindings, etc. I believe this is not complete.
- jobs.py: script code to run a bunch of cache jobs, conform jobs, etc as separate threads. I’m not clear how this multithreading interacts with the multiprocessing code in __init.py__
- process.py: a main program. this invokes all the work described in jobs.py, then does some processing to extract report-like information (like the map). This is doing I/O via S3 and caches, not so much local disk.
- render.py: Cairo and geo code to draw the map of data sources
- run.py: a wrapper for running stuff on EC2
- sample.py: extract some sample features from a GeoJSON file, used for excerpts?
- summarize.py: produce a report about the job run using an HTML template.
- template/state.html: the actual HTML file that is visible at http://data.openaddresses.io/. Or rather a Jinja2 template.
- template/user-data.sh; not sure what this is. It seems to be a templatized “run” script that installs everything from scratch on a new machine. I imagine this is what Mike runs first on a fresh EC2 instance.
- geodata/*: shape files. I think this is used by render.py to draw pictures of FIPS codes.
Not part of the Machine repo, these are four small Node projects that contain code for managing OpenAddresses data. This is the code that’s being “ditched” in the branch, and in fact I think it’s no longer used outside the tests.
- openaddr-source: this is the openaddresses project itself. The main thing is sources/*.json, a bunch of JSON files that describe data sources and their conforms.
- openaddr-download: code that takes a source URL as input and downloads it. lib/connectors.js has the actual code to do FTP and HTTP requests. Also is able to get data out of ESRI services using esri-dump, another program by Ian Dees that has both Python and Node versions.
- openaddr-cache: another downloading program, this one is focussed on caching source data in S3.
- openaddr-conform: the big program, glues together all the other scripts into a single thing that takes a source JSON as input, emits a processed CSV as output. So it has a little download and cache logic in it as well as a bunch of ETL code for extracting data out of source files and emitting it in the simple CSV schema OpenAddresses publishes. it in turn relies on ogr2ogr and other tools for extraction.
Test suite for The Machine. All the code is currently in test.py, just 11 tests. tests/ is test data files.
Scripts to set up an Ubuntu server to run The Machine. I haven’t actually run these on my dev box, just copied out bits of it as necessary. Some of the things it does:
- run.sh: install Ruby and chef, the invoke chef-solo
- role-ubuntu.json: spec for The Machine. Installs Python, Node, and four openaddr-* Node modules.
- **/default.rb: specs for each package. There aren’t many interesting bits here. Node is wired to version 0.10.33. Python installs Cairo, GDAL, and PIP. The openaddr-* modules get installed in /var/opt. Only openaddr-conform is now needed for the tests to run in ditch-node, although I’m not sure about the full machine loop.