openaddresses-conform Node code notes (plus bonus Python)

I’m working on porting openaddresses-conform from NodeJS to Python. First time I really sat down and read the code from start to finish. It’s remarkably clean and easy to follow, once you read past all the async-back-bending.

The basic flow of the tool is to take arbitrary source files, convert them to CSV files in a schema similar to the source, then convert that source CSV schema to the out.csv schema of OpenAddresses, the one with LON,LAT,NUMBER,STREET. It works by creating a lot of files on disk to hold intermediate results of transformations.

Here’s the core logic, which starts at processSource(). ConformCLI() is the command line tool entry point.

processSource(): download a source from the Internet, conform it, upload out.csv to S3.

conformCache() is the meat of the extract and transform process. It has comments!

  • Convert to UTF8: if source.conform.encoding is set, use iconv to convert source file to a new file with UTF8 encoding.
  • Convert to CSV: turn all types of sources in to a CSV file (see below on convert.js)
  • Merge columns if source.conform.merge is set.
  • Advanced merge columns if source.conform.advanced_merge is set.
  • Split address if source.conform.split is set.
  • Drop columns: prepare a new CSV file with just the core LON,LAT,NUMBER,STREET fields picked out of the processed source CSV.
  • Reproject to EPSG:4326. Only for CSV and GeoJSON files, creates a VRT file pointing at the processed source CSV and uses ogr2ogr to convert it to WGS 84.
  • Expand abbreviations: Fix Capitalization & drop null rows. This code is a bit woolly; it has a comment “I haven’t touched it because it’s magic” and there are no tests.

convert.js has five separate functions for converting different source types to a CSV file in the source schema.

  • shp2csv(): uses OGR2OGR to convert a shapefile to a CSV file. Also reprojects to EPSG:4326. Quite simple code.
  • polyshp2csv(): very similar to shp2csv(), but handles polygons.
  • json2csv(): uses geojson-stream to convert a GeoJSON file to a CSV file. Figure out headers from the first object’s properties. Converts polygons to points with a very simple average.
  • csv(): uses Node CSV modules to read the source CSV file and transform it into a cleaner CSV for further processing. Implements source.conform.headers and source.conform.skiplines.
  • xml(): uses ogr2ogr to convert an XML input to CSV, with reprojection. Very simple code, may not be used yet?
  • there’s code for computing the center of a polygon that’s duplicated in polyshp2csv() and json2csv().

openaddresses-conform has a small test suite. I believe it works on full source files, it’s more like functional tests than unit tests, running big chunks like conform or download and verifying the output is reasonable.

Existing Python port

Machine has its own Python code for doing the equivalent of downloadCache() and updateCache(). It’s quite different from the Node code strategy and is extra complicated because of the way Machine was a wrapper for invoking the Node code.

Very little of the extract and transform logic of conformCache() has been ported to Python but @iandees did some back in November. That code lives mostly in conform.py:ConvertToCsvTask. It works differently from the Node code. It basically takes whatever input file has been downloaded and uses Python OGR bindings to convert it to a CSV file in the schema of the source, plus an extra computed centroid. There’s no code yet to extract attributes or merge columns or anything. It’s streaming code, never loads the whole thing in to RAM.

I’m considering this choice to lean so heavily on OGR to process source files, it’s different from how the Node code is designed. I asked Ian and he said at the time all sources could be processed with ogr2ogr, so no need to write extra code for source CSV parsing, etc.

  • Check the type is .shp, .json, .csv, or .kml
  • Open source with OGR and figure out what field names are in it.
  • Set up OGR to transform the source SRS to EPSG:4326
  • Create a CSV file using csv.DictWriter with headers from the field name.
  • Iterate over features in the open OGR
  • For each feature, write a row to a CSV file with columns set to all fields from the feature. Also compute a centroid column.

I’ve never used the Python OGR bindings before. It’s pretty un-Pythonic and now I understand why Fiona exists. This list of Python Gotchas in GDAL will be a useful reference going forward. Apparently all the calls to Destroy() are unnecessary, but you do need to take care to manually dereference data sources.