Nelson's log

openaddresses-conform Node code notes (plus bonus Python)

I’m working on porting openaddresses-conform from NodeJS to Python. First time I really sat down and read the code from start to finish. It’s remarkably clean and easy to follow, once you read past all the async-back-bending.

The basic flow of the tool is to take arbitrary source files, convert them to CSV files in a schema similar to the source, then convert that source CSV schema to the out.csv schema of OpenAddresses, the one with LON,LAT,NUMBER,STREET. It works by creating a lot of files on disk to hold intermediate results of transformations.

Here’s the core logic, which starts at processSource(). ConformCLI() is the command line tool entry point.

processSource(): download a source from the Internet, conform it, upload out.csv to S3.

conformCache() is the meat of the extract and transform process. It has comments!

convert.js has five separate functions for converting different source types to a CSV file in the source schema.

openaddresses-conform has a small test suite. I believe it works on full source files, it’s more like functional tests than unit tests, running big chunks like conform or download and verifying the output is reasonable.

Existing Python port

Machine has its own Python code for doing the equivalent of downloadCache() and updateCache(). It’s quite different from the Node code strategy and is extra complicated because of the way Machine was a wrapper for invoking the Node code.

Very little of the extract and transform logic of conformCache() has been ported to Python but @iandees did some back in November. That code lives mostly in conform.py:ConvertToCsvTask. It works differently from the Node code. It basically takes whatever input file has been downloaded and uses Python OGR bindings to convert it to a CSV file in the schema of the source, plus an extra computed centroid. There’s no code yet to extract attributes or merge columns or anything. It’s streaming code, never loads the whole thing in to RAM.

I’m considering this choice to lean so heavily on OGR to process source files, it’s different from how the Node code is designed. I asked Ian and he said at the time all sources could be processed with ogr2ogr, so no need to write extra code for source CSV parsing, etc.

I’ve never used the Python OGR bindings before. It’s pretty un-Pythonic and now I understand why Fiona exists. This list of Python Gotchas in GDAL will be a useful reference going forward. Apparently all the calls to Destroy() are unnecessary, but you do need to take care to manually dereference data sources.