Code in The OpenAddress Machine

I’m trying to help with the OpenAddress project, particularly Mike Migurski’s “machine” code for running a full analysis of the data. Particularly particularly the ditch-node branch. Managed to get the tests running after some version hassle.

Now I’m taking inventory of what all code is in the repo to understand where to work. These notes are probably only useful right about now (December 2014) and only for the ditch-node branch.

openaddr/*.py

This is the big thing, about 1700 lines of Python code. The architecture is a bit confusing. I don’t really understand how it’s managing threads / processes for parallelism. Also the actual ETL code is a bit convolved with the download and cache code, including S3.

  • __init.py__: grab bag of top level stuff. Key methods:
    • cache() and conform() are wrappers around Python code to download and process data from sources. This code wraps Python modules and uses multiprocessing.Process to run them as separate jobs. (see also jobs.py) The comments say these are Node wrappers, but they no longer are in this branch.
    • excerpt() pulls bits of data out of source files for user presentation. Contains some logic for parsing CSV, Shapefiles, etc to get a few rows out of them. See also sample.py. Excerpt does not use Process.
  • cache.py: Download from HTTP, FTP, and ESRI sources. I believe this code entirely writes to local disk and does not use S3. The ESRI code has logic about doing database-like queries on URLs and rewriting the ESRI proprietary JSON to GeoJSON
  • conform.py: python code to translate source data to the OpenAddress CSV output schema. Uses the OGR Python bindings, etc. I believe this is not complete.
  • jobs.py: script code to run a bunch of cache jobs, conform jobs, etc as separate threads. I’m not clear how this multithreading interacts with the multiprocessing code in __init.py__
  • process.py: a main program. this invokes all the work described in jobs.py, then does some processing to extract report-like information (like the map). This is doing I/O via S3 and caches, not so much local disk.
  • render.py: Cairo and geo code to draw the map of data sources
  • run.py: a wrapper for running stuff on EC2
  • sample.py: extract some sample features from a GeoJSON file, used for excerpts?
  • summarize.py: produce a report about the job run using an HTML template.
  • template/state.html: the actual HTML file that is visible at http://data.openaddresses.io/. Or rather a Jinja2 template.
  • template/user-data.sh; not sure what this is. It seems to be a templatized “run” script that installs everything from scratch on a new machine. I imagine this is what Mike runs first on a fresh EC2 instance.
  • geodata/*: shape files. I think this is used by render.py to draw pictures of FIPS codes.

openaddr-*

Not part of the Machine repo, these are four small Node projects that contain code for managing OpenAddresses data. This is the code that’s being “ditched” in the branch, and in fact I think it’s no longer used outside the tests.

  • openaddr-source: this is the openaddresses project itself. The main thing is sources/*.json, a bunch of JSON files that describe data sources and their conforms.
  • openaddr-download: code that takes a source URL as input and downloads it. lib/connectors.js has the actual code to do FTP and HTTP requests. Also is able to get data out of ESRI services using esri-dump, another program by Ian Dees that has both Python and Node versions.
  • openaddr-cache: another downloading program, this one is focussed on caching source data in S3.
  • openaddr-conform: the big program, glues together all the other scripts into a single thing that takes a source JSON as input, emits a processed CSV as output. So it has a little download and cache logic in it as well as a bunch of ETL code for extracting data out of source files and emitting it in the simple CSV schema OpenAddresses publishes. it in turn relies on ogr2ogr and other tools for extraction.

test.py, tests/

Test suite for The Machine. All the code is currently in test.py, just 11 tests. tests/ is test data files.

chef/

Scripts to set up an Ubuntu server to run The Machine. I haven’t actually run these on my dev box, just copied out bits of it as necessary. Some of the things it does:

  • run.sh: install Ruby and chef, the invoke chef-solo
  • role-ubuntu.json: spec for The Machine. Installs Python, Node, and four openaddr-* Node modules.
  • **/default.rb: specs for each package. There aren’t many interesting bits here. Node is wired to version 0.10.33. Python installs Cairo, GDAL, and PIP. The openaddr-* modules get installed in /var/opt. Only openaddr-conform is now needed for the tests to run in ditch-node, although I’m not sure about the full machine loop.

kvm and Ubuntu snappy

I just launched my first Linux virtual machine ever, thanks to Ubuntu Snappy. Ubuntu 14.04 as the host and the stripped-down Snappy as the guest OS. The “locally with kvm” instructions worked fine except I had to add -nographic to the kvm command line to get it to launch headless.

The guest image has 512MB of RAM available, but the host qemu process has 940MB mapped. Wonder what that’s about? Presumably there’s ways to configure the resources availalbe on the guest machine.

I like the idea of Snappy, an Ubuntu distribution specially stripped down for VM guests, Docker, etc. I don’t know the market well enough to know if it’s competitive. I am much more inclined to use a guest OS if it feels like Ubuntu though. Confused about what snappy is doing as a package manager. apt-get is not enabled. dpkg is there and a bunch of stuff is installed, but I think you manage the Ubuntu core OS with snappy.

The guest reboots in about 11 seconds.

disk speed

2TB USB drives are now under $100. Which is awesome! But boy that’s a lot of disk to write. How long does it take?

Sadly my iMac is USB 2.0 only. In practice that gets about 21 MBytes/second with the disk. At that speed it takes about 27 hours to read or write all 2TB.

My Linux box has USB3 ports, so the data interface is no longer the limiting factor. In practice I’m getting about 110 MBytes/second, at least at the start of the disk. That’s 5 hours and is presumably gated by how fast the bits are spinning by the read head. Update: took nearly 7 hours, aggregate speed was about 82 MB/s.

(Side note:  how do you know you have a USB 3 connection? The plastic inside the port is colored blue, that’s the first tip-off. On Linux you can also inspect the interface speed with “lsusb -t”; 5000M means USB 3. Details on Stack Exchange.)

I’m not finding a reliable source for theoretical throughput on a 5400 RPM spinning platter. Numbers I’ve seen are in the realm of 50–150 MBytes/second. I’d forgotten it depends on the density of the disk. The little 2.5″ disks I prefer should be faster than bigger 3.5″ disks, all things being equal.

There are a few 7200 RPM 2.5″ portable drives out there now, but they’re unusual enough that they are expensive and I don’t fully trust them. Naively, at best that’s a 33% speedup for bulk throughput, not a huge difference. Faster seek times though.

All the rivers (Australia)

Got a decent looking work print of a map of all the Australian rivers, as inspired by Chad Ajamian’s map. It’s on Imgur as a JPG, wordpress.com seems to have done a reasonable job with the full PNG, compressed it better than I can (!).

I produced this in TileMill, using data imported to PostGIS with ogr2ogr. I hoped to use the new MapBox Studio but it’s kind of a pain. Working with local data sources is awkward and there’s no simple exporting to a hires image. TileMill still gets the job done, what a champ! QGis also works and is faster (caching?), but its pixel rendering isn’t as beautiful as TileMill’s.

Map is styled on two parameters. Color is brown or blue depending on whether the stream is perennial, line width varies by UpstrGeoLn, the length of all line segments that contribute to this line segment. That’s not a bad proxy for stream importance, actually, I’m using it in lieu of Strahler number. I like the results.

Here’s the relevant part of the TileMill style sheet.

#streams2[perennial="Perennial"] {
  line-color: @stream;
  line-width: 0.5;
  [upstrgeoln > 10000] { line-width: 1.0; }
  [upstrgeoln > 100000] { line-width: 1.5; }
  [upstrgeoln > 1000000] { line-width: 2; }
}

australian-rivers_c21b28

SRID 900914

Here’s a new one on me; I used ogr2ogr to import the Australian river data into PostGIS and it ended up being labeled with SRID 900914. WTF? According to this discussion that’s what ogr2ogr does if the SRID isn’t already defined in the PostGIS tables. It adds a new entry with the CRS info gleaned from the source file. And since the largest defined SRID in most PostGIS installations is 900913 (the bogus Google entry) why, the next one created will be 900914.

The data is actually in EPSG:4283 which PostGIS already knows about.

While I’m here, this Stack Exchange answer has some useful ogr2ogr flags for big imports. -progress and –config PG_USE_COPY YES. According to the docs COPY may now be the default, but I’m not sure it is. Adding it cut my running time to 20% of without.

python-pip is broken in Ubuntu 14.04

Running pip on my Ubuntu 14.04 box gives the error “ImportError: cannot import name IncompleteRead” when trying to import the requests module. Turns out this is a known bug going back to at least April. It’s subtle though; a different Ubuntu box with the same versions of python-pip and python-requests is fine. It’s apparently a bug in python-requests, see also Migurski’s workaround.

A quick workaround is to remove Ubuntu’s python-pip and install pip manually.

Classifying Australian rivers

So the Australian river database doesn’t have Strahler numbers for streams, which I need to classify streams in order of importance. I got a very nice reply from the AHGF team asking about Strahler number with some ideas on what to do.

One option is to classify on a different measure. AHGFNetworkStream has a couple of features that are useful; UpstrGeoLn and UpstrDArea. They are “Combined geodesic length of contributing upstream AHGFNetworkStream features (incl. segment).” and “Combined albers area (m²) of contributing upstream AHGFCatchment features (incl. segment).”, respectively. Ie: how long the line is that feeds into this line, or how big the area that feeds into this line. It’s worth a look. Of course what I really want is actual water flow, but that’s not readily available.

The other option is to calculate the Strahler number myself. It seems like a simple enough algorithm. As long as the data is clean, that is, a strict tree. The AHGF folks warned me their data has loops and splits that might cause problems.

ESRI tools apparently calculate Strahler, but that’s not useful to me.

GRASS also apparently has some Strahler calculations. (Or Horton, which is similar. “Stream order” is the generic term.) More details here. I wonder which is easier; using GRASS or writing my own code? :-P