OpenAddresses run: causes of failures

206 of the 608 sources I tried with the new Python code didn’t work. I went through and hand-classified them by type of failure. (Related: we need better error reporting in the code, I could only do this by reading stderr logfiles and guessing.)

I’d guess about half the failures are problems with the source not being online at the time. And maybe only a quarter of the problems are bugs in the Python parsing code. But not sure without looking more closely.

Update: I triaged and either solved or filed issues for the following failures:

  • Parser doesn’t handle malformed CSV
  • Parser doesn’t handle GML files
  • Excerpt failed (partial fix; the conform continues)
  • Run worked but no output CSV
=== Problems with the parser ===

Parser doesn't handle malformed CSV
https://github.com/openaddresses/machine/issues/36
  48 files: kr-seoul-*

Parser doesn't handle GML files
https://github.com/openaddresses/machine/issues/38
  15 files: pl-*

Excerpt failed
https://github.com/openaddresses/machine/issues/35
  is
  jp-akita, jp-chiba, jp-fukushima, jp-gunma, jp-ibaraki, jp-iwate, jp-saitama, jp-tochigi, jp-yamagata
  au-queensland
  ca-ab-calgary
  us-mn-dakota

Shapefile parsing problem
  ca-bc-okanagan_similkameen
  10 files: us-nc-? us-nc-10
  us-nc-lincoln
  us-ne-omaha
  us-va-stafford

CSV parsing problem
  be-flanders
  ca-pe
  us-co-mesa
  us-va

JSON parsing problem
  us-al-shelby
  us-ga-muscogee
  us-id-canyon
  us-in-hamilton
  us-mn-pope
  us-mn-wadena
  us-ms-hinds
  us-ms-madison
  us-nm-san_juan
  us-oh-hamilton
  us-pa-philadelphia
  us-tx-denton us-tx-el_paso us-tx-keller us-tx-north_richland_hills
  us-va-alexandria
  us-va-city_of_emporia
  us-va-city_of_petersburg
  us-wi-adams us-wi-crawford us-wi-dodge


=== Mysteries ===

Run worked but no output CSV
  ca-bc-surrey
  us-ak-matanuska_susitna_borough
  us-al-calhoun
  us-mi-kent
  us-mi-ottawa
  us-mn-otter_tail
  us-nv-las_vegas
  us-pa-beaver
  us-sc-berkeley
  us-sc-lexington
  us-sd
  us-va-accomack
  us-va-city_of_norton
  us-va-essex
  us-va-fluvanna
  us-wi-fond_du_lac
  us-wi-vernon
  us-wy-laramie


=== Malformed sources ===

Multiple shapefiles without a file attribute
  ca-bc-kamloops
  ca-bc-kelowna
  us-nc-davie
  us-oh-clinton
  us-wi-calumet


=== Problems getting source data ===

Download failed
  dk
  ca-ab-strathcona-county
  ca-bc-nanaimo
  ca-bc-vernon
  ca-ns-halifax
  ca-sk-regina
  us-ar
  us-ca-alameda_county
  us-ca-amador
  us-ca-san_francisco
  us-co-gunnison
  us-in-st_joseph
  us-mn-metrogis
  us-nc
  us-nc-wake_county
  us-ny-nyc
  us-va-augusta
  us-va-richmond_city
  us-wa-snohmish
  us-wi-superior
  za-nl-ethekwini

ESRI download failed
  us-ct-avon us-ct-haddam us-ct-lyme us-ct-watertown
  us-fl-alachua
  us-ia-linn
  us-ia-polk
  us-in-madison
  us-in-marion_county
  us-ky-oldham
  us-la-acadia
  us-mi-muskegon
  us-mn-polk
  us-mn-yellow_medicine
  us-mo-barry us-mo-columbia us-mo-st_louis_county
  us-mt-park
  us-nv-henderson us-nv-lander us-nv-nye us-nv-washoe_county
  us-tn-memphis
  us-va-roanoke
  us-va-salem
  us-wa-san_juan
  us-wi-iron us-wi-jefferson us-wi-juneau us-wi-lincoln us-wi-oneida us-wi-richland us-wi-sauk

Bad zip file
  nz
  ca-bc-langley
  ca-bc-west_kelowna
  us-co-sanmiguel
  us-fl-collier
  us-ga-glynn
  us-la-st_james
  us-ma
  us-nc-charlotte us-nc-columbus
  us-ri
  us-tx-round_rock
  us-va-city_of_falls_church

First run of Python OpenAddress code against full source set

After a couple of weeks of steady work the Python code to process OpenAddresses is nearly feature complete. I did a big run last night over all the sources we have. It went well!

Output statistics

  • 608 input sources: 402 out.csv files, 372 sample.json files
  • 80M out.csv rows. (Compare 100M from the Node code)
  • 3 hours, 5 minutes running time
  • < 16G RAM running 8x parallel
  • 11.5G of useful output files
  • 100+G disk used (temp files left on error)
  • 7G of cached source data: 3.5G zip, 3.5G json
  • 150Wh = $0.50 of electricity

Operational behavior

I ran the whole thing with 8x parallelism using GNU Parallel on my Linux box. The box is nothing special; a quad core i7 with 16G of RAM and a very ordinary 7200RPM hard drive. Home network is 100Mbit for downloads.

I think disk contention was the primary bottleneck, an SSD might make a big difference. So would keeping data in RAM while we process it, I think I overwhelmed the disk cache. FWIW the disk has a benchmarked performance of about 120 MBytes/second. But having listened to the thing, there was an awful lot of seeking going on. Parallelism isn’t friends with a single spinning drive.

Below are some munin graphs for the day of the run. There’s two runs here; an aborted run to the disk sda (which filled), then a successful run around 04:30 to 07:30 to the disk sdb. Not sure the graphs have a whole lot of information. They do show some serious IO contention; the CPU graph shows as much time spent in IO wait as doing actual work and disk latencies hit 500ms for awhile there, ugh. The memory graph is too messy to read but I do know the system never needed to swap.

GNU parallel for openaddr-process-one

I’ve been running batches of OpenAddress conversion jobs for bulk testing with a simple shell for loop

for f in us-ca-*.json; do
  openaddr-process-one &quot;$f&quot; /tmp/oa
done

But it’s better to do this in parallel, particularly since the work is a nice mix of CPU, network, and disk IO. Below is my first ever GNU parallel script to do that.

How much parallel? My system has 16 gigs of RAM and 4 honest CPUs (looks like 8 cores because of hyperthreading). I originally tried running “–jobs 200%” but because of the hyperthreading that results in 16 jobs which feels like a lot. So I’m keeping it at 8.

With 8 jobs in parallel we process 29 files in about 14 minutes. Running that in serial took 48 minutes. Los Angeles County alone takes 11 minutes and Solano County takes 10, so a total of 14 minutes to run 29 files sounds pretty good to me.

us-ca-los_angeles_county downloads fast, all its time is being spent in my Python conform code. There’s 3M rows so it’s no wonder it’s slow, but that seems too slow. Good place to focus some code optimization effort.

Solano county takes 10 minutes to download. It’s one of those ESRI webservice things, 391 requests for 500 records at a time and each one takes 1 or 2 seconds. I’ve been thinking we ought to have a separate pre-caching process for the ESRI combining, store the GeoJSON ripped data in S3 somewhere. There’s no way to test for data changes on the ESRI services so we can’t cache if we go direct, this way we could insulate the rest of the machine from the slowness of the services and the caching problems.

We wrote the code to be careful about RAM, processing everything as streams. Most jobs only take a polite 70M to run, but us-ca-kern, us-ca-marin_county, and us-ca-solano all mapped 1G+. Those are all ESRI type services, maybe the code is buffering all the data in RAM? Need to look more closely.

I suspect the main point of contention for running in parallel is disk IO. There’s a lot of file writing and copying; my own conform code basically writes the whole dataset twice. In theory I could get that down to once but it makes the code significantly more complicated.

Running this in parallel like this makes me realize there’s still work to do on the logging configuration. I want to be able to say “log only to a file named on the command line with a configuration specified in a file named here”; my logging config code doesn’t allow that. I can use the command line config to log to a file, then we just shove all of stderr to /dev/null. It’d be nicer to ask parallel drop stderr from the jobs its running but I couldn’t figure out how to do that. Even better would be to make it so I can configure the Python logging the way I want.

#!/bin/bash

# Quickie script to run a bunch of openaddr-process-one jobs in parallel

srcdir=/home/nelson/src/oa/openaddresses-me/sources
out=/tmp/oa
sources='us-ca-*.json'

# Only try to process sources with a conform
sources=$(cd &quot;$srcdir&quot;; grep -l 'conform' $sources)
sources=$(echo &quot;$sources&quot; | sed s/\.json$//g)
mkdir -p &quot;$out&quot;

# Echo a bunch of commands to run; these could be piped in to shell to run sequentially
#for s in $sources; do
#    echo openaddr-process-one -v -l &quot;$out/$s.log&quot; &quot;$srcdir/$s.json&quot; &quot;$out&quot;
#done

start=`date +%s`
parallel -u --jobs 8 openaddr-process-one -v -l &quot;$out/{}.log&quot; &quot;$srcdir/{}.json&quot; &quot;$out&quot; ::: $sources 2&gt; /dev/null
end=`date +%s`
echo `echo $sources | wc -w` jobs run in $[end-start] seconds

GitHub collaboration notes

After my whining about pull requests Mike gave me commit access to the repo. Here’s some quick notes on some simple git stuff he helped me do. The setup is we both started working in on separate projects. He made a private branch, made a bunch of changes, then pushed them to the ditch-node branch. Meanwhile I had made my own pyconform branch off of ditch-node before his changes and wanted to bring mine in with his. We could have just merged, but Mike asked me to rebase so that my changes came after his to keep the history tidy. That was easy because there were no conflicts. Here’s how it went.

# Push my pyconform branch up to GitHub
  506  git push -u origin pyconform

# Fetch Mike's changes in ditch-node
  507  git fetch origin ditch-node

# In my pyconform banch, rebase with the changes in ditch-node.
# This happened automatically without extra work on my part, no conflicts
# No commit necessary.
  510  git rebase origin/ditch-node

# Try to push my newly altered pyconform branch to GitHub
  513  git push
# That didn't work because we've altered history and Git is confused.
# So now we have to force the push to rewrite GitHub's version of things
  514  git push -f origin pyconform

# At this point my pyconform branch is done and looks as if I'd forked it off
# of Mike's changed, even though really we were working in parallel. Now it's
# time to merge my changes back to ditch-node and get rid of the pyconform branch

# Switch to ditch-node
  515  git checkout ditch-node
# Pull in the changes that are on GitHub
  516  git pull
# Merge from my pyconform branch
  517  git merge pyconform
# Push the merged commits back up to GitHub
  519  git push
# Delete my branch locally
  520  git branch -d pyconform
# Delete the branch on GitHub.
  521  git push origin pyconform
# Ooops, that didn't work, so do this magic to remotely delete the branch
  522  git push origin :pyconform

I regret not being more comfortable with this kind of git usage. I am sort of OK with git by myself but it’s a lot more complicated when collaborating.

openaddresses-conform Node code notes (plus bonus Python)

I’m working on porting openaddresses-conform from NodeJS to Python. First time I really sat down and read the code from start to finish. It’s remarkably clean and easy to follow, once you read past all the async-back-bending.

The basic flow of the tool is to take arbitrary source files, convert them to CSV files in a schema similar to the source, then convert that source CSV schema to the out.csv schema of OpenAddresses, the one with LON,LAT,NUMBER,STREET. It works by creating a lot of files on disk to hold intermediate results of transformations.

Here’s the core logic, which starts at processSource(). ConformCLI() is the command line tool entry point.

processSource(): download a source from the Internet, conform it, upload out.csv to S3.

conformCache() is the meat of the extract and transform process. It has comments!

  • Convert to UTF8: if source.conform.encoding is set, use iconv to convert source file to a new file with UTF8 encoding.
  • Convert to CSV: turn all types of sources in to a CSV file (see below on convert.js)
  • Merge columns if source.conform.merge is set.
  • Advanced merge columns if source.conform.advanced_merge is set.
  • Split address if source.conform.split is set.
  • Drop columns: prepare a new CSV file with just the core LON,LAT,NUMBER,STREET fields picked out of the processed source CSV.
  • Reproject to EPSG:4326. Only for CSV and GeoJSON files, creates a VRT file pointing at the processed source CSV and uses ogr2ogr to convert it to WGS 84.
  • Expand abbreviations: Fix Capitalization & drop null rows. This code is a bit woolly; it has a comment “I haven’t touched it because it’s magic” and there are no tests.

convert.js has five separate functions for converting different source types to a CSV file in the source schema.

  • shp2csv(): uses OGR2OGR to convert a shapefile to a CSV file. Also reprojects to EPSG:4326. Quite simple code.
  • polyshp2csv(): very similar to shp2csv(), but handles polygons.
  • json2csv(): uses geojson-stream to convert a GeoJSON file to a CSV file. Figure out headers from the first object’s properties. Converts polygons to points with a very simple average.
  • csv(): uses Node CSV modules to read the source CSV file and transform it into a cleaner CSV for further processing. Implements source.conform.headers and source.conform.skiplines.
  • xml(): uses ogr2ogr to convert an XML input to CSV, with reprojection. Very simple code, may not be used yet?
  • there’s code for computing the center of a polygon that’s duplicated in polyshp2csv() and json2csv().

openaddresses-conform has a small test suite. I believe it works on full source files, it’s more like functional tests than unit tests, running big chunks like conform or download and verifying the output is reasonable.

Existing Python port

Machine has its own Python code for doing the equivalent of downloadCache() and updateCache(). It’s quite different from the Node code strategy and is extra complicated because of the way Machine was a wrapper for invoking the Node code.

Very little of the extract and transform logic of conformCache() has been ported to Python but @iandees did some back in November. That code lives mostly in conform.py:ConvertToCsvTask. It works differently from the Node code. It basically takes whatever input file has been downloaded and uses Python OGR bindings to convert it to a CSV file in the schema of the source, plus an extra computed centroid. There’s no code yet to extract attributes or merge columns or anything. It’s streaming code, never loads the whole thing in to RAM.

I’m considering this choice to lean so heavily on OGR to process source files, it’s different from how the Node code is designed. I asked Ian and he said at the time all sources could be processed with ogr2ogr, so no need to write extra code for source CSV parsing, etc.

  • Check the type is .shp, .json, .csv, or .kml
  • Open source with OGR and figure out what field names are in it.
  • Set up OGR to transform the source SRS to EPSG:4326
  • Create a CSV file using csv.DictWriter with headers from the field name.
  • Iterate over features in the open OGR
  • For each feature, write a row to a CSV file with columns set to all fields from the feature. Also compute a centroid column.

I’ve never used the Python OGR bindings before. It’s pretty un-Pythonic and now I understand why Fiona exists. This list of Python Gotchas in GDAL will be a useful reference going forward. Apparently all the calls to Destroy() are unnecessary, but you do need to take care to manually dereference data sources.

Code in The OpenAddress Machine

I’m trying to help with the OpenAddress project, particularly Mike Migurski’s “machine” code for running a full analysis of the data. Particularly particularly the ditch-node branch. Managed to get the tests running after some version hassle.

Now I’m taking inventory of what all code is in the repo to understand where to work. These notes are probably only useful right about now (December 2014) and only for the ditch-node branch.

openaddr/*.py

This is the big thing, about 1700 lines of Python code. The architecture is a bit confusing. I don’t really understand how it’s managing threads / processes for parallelism. Also the actual ETL code is a bit convolved with the download and cache code, including S3.

  • __init.py__: grab bag of top level stuff. Key methods:
    • cache() and conform() are wrappers around Python code to download and process data from sources. This code wraps Python modules and uses multiprocessing.Process to run them as separate jobs. (see also jobs.py) The comments say these are Node wrappers, but they no longer are in this branch.
    • excerpt() pulls bits of data out of source files for user presentation. Contains some logic for parsing CSV, Shapefiles, etc to get a few rows out of them. See also sample.py. Excerpt does not use Process.
  • cache.py: Download from HTTP, FTP, and ESRI sources. I believe this code entirely writes to local disk and does not use S3. The ESRI code has logic about doing database-like queries on URLs and rewriting the ESRI proprietary JSON to GeoJSON
  • conform.py: python code to translate source data to the OpenAddress CSV output schema. Uses the OGR Python bindings, etc. I believe this is not complete.
  • jobs.py: script code to run a bunch of cache jobs, conform jobs, etc as separate threads. I’m not clear how this multithreading interacts with the multiprocessing code in __init.py__
  • process.py: a main program. this invokes all the work described in jobs.py, then does some processing to extract report-like information (like the map). This is doing I/O via S3 and caches, not so much local disk.
  • render.py: Cairo and geo code to draw the map of data sources
  • run.py: a wrapper for running stuff on EC2
  • sample.py: extract some sample features from a GeoJSON file, used for excerpts?
  • summarize.py: produce a report about the job run using an HTML template.
  • template/state.html: the actual HTML file that is visible at http://data.openaddresses.io/. Or rather a Jinja2 template.
  • template/user-data.sh; not sure what this is. It seems to be a templatized “run” script that installs everything from scratch on a new machine. I imagine this is what Mike runs first on a fresh EC2 instance.
  • geodata/*: shape files. I think this is used by render.py to draw pictures of FIPS codes.

openaddr-*

Not part of the Machine repo, these are four small Node projects that contain code for managing OpenAddresses data. This is the code that’s being “ditched” in the branch, and in fact I think it’s no longer used outside the tests.

  • openaddr-source: this is the openaddresses project itself. The main thing is sources/*.json, a bunch of JSON files that describe data sources and their conforms.
  • openaddr-download: code that takes a source URL as input and downloads it. lib/connectors.js has the actual code to do FTP and HTTP requests. Also is able to get data out of ESRI services using esri-dump, another program by Ian Dees that has both Python and Node versions.
  • openaddr-cache: another downloading program, this one is focussed on caching source data in S3.
  • openaddr-conform: the big program, glues together all the other scripts into a single thing that takes a source JSON as input, emits a processed CSV as output. So it has a little download and cache logic in it as well as a bunch of ETL code for extracting data out of source files and emitting it in the simple CSV schema OpenAddresses publishes. it in turn relies on ogr2ogr and other tools for extraction.

test.py, tests/

Test suite for The Machine. All the code is currently in test.py, just 11 tests. tests/ is test data files.

chef/

Scripts to set up an Ubuntu server to run The Machine. I haven’t actually run these on my dev box, just copied out bits of it as necessary. Some of the things it does:

  • run.sh: install Ruby and chef, the invoke chef-solo
  • role-ubuntu.json: spec for The Machine. Installs Python, Node, and four openaddr-* Node modules.
  • **/default.rb: specs for each package. There aren’t many interesting bits here. Node is wired to version 0.10.33. Python installs Cairo, GDAL, and PIP. The openaddr-* modules get installed in /var/opt. Only openaddr-conform is now needed for the tests to run in ditch-node, although I’m not sure about the full machine loop.