Census of US county parcel records

Fairview Industries has a great map of which US counties publish GIS parcel data. Mike got the shapefile for the database. I went ahead and extracted just the URLs and county names to a convenient CSV file.

The file lists a total of 3214 counties. 2022 have ViewURLs defined and 590 have DownURLs defined.

Here’s my extractor code.

#!/usr/bin/env python3

import fiona, sys, csv

out = csv.writer(open('extracted.csv', 'w', encoding='utf-8'))

out.writerow(('FIPS', 'State', 'County', 'ViewURL', 'DownURL'))

count = 0
gisInvLks = 0
viewUrls = 0
downUrls = 0
with fiona.drivers():
    with fiona.open('counties.shp') as counties:
        for county in counties:
            props = county['properties']
            # Fix mojibake; Fiona read these strings as ISO-Latin-1 but they are actually UTF-8
            cntyNm = props['CntyNm'].encode('latin_1')
            cntyString = cntyNm.decode('utf-8')
            out.writerow((props['StCntyFIPS'], props['StNm'], cntyString, props['ViewURL'], props['DownURL']))
            count += 1
            if props['ViewURL']:
                viewUrls += 1
            if props['DownURL']:
                downUrls += 1

sys.stderr.write('%d rows\n%d ViewURLs\n%d DownURLs\n' % (count, viewUrls, downUrls))

Python CSV benchmarks

I tested various ways of reading a CSV file in python, from simply reading the file line by line to using the full unicodecsv DictReader. Here’s what I discovered. Test data is dk.csv, a 3.4M row CSV file with 46 columns. (See also: file reading benchmarks.)

  • The Python2 csv module takes 2x longer than a naive split(‘,’) on every line
  • Python2 DictReader takes 2-3x longer than the simple csv reader that returns tuples
  • Python2 unicodecsv takes 5.5x longer than csv
  • Python3 csv takes 2-3x longer than Python2 csv. However it is Unicode-correct
  • Pandas in Python2 is about the same speed as DictReader, but is Unicode-correct.

I’m not sure why unicodecsv is so slow. I did a quick look in cProfile and all the time is being spent in next() where you’d expect. All those isinstance tests add significant time (20% or so) but that’s not the majority of the 5.5x slowdown. I guess string decoding is just a lot of overhead in Python2? It’s not trivial in Python3 either; I hadn’t realized how much slower string IO was in Py3. I wonder if there’s more going on. Anyway I filed an issue on unicodecsv asking about performance.

I’ve never used Pandas before. I ran into someone else saying unicodecsv is slow who switched to Pandas. It sure is fast! I think it’s a lot of optimized C code. But Pandas is a big package and has its own model of data and I don’t know that I want to buy into all of that. Its CSV module is nicely feature-rich though.

Not sure what conclusion to draw for OpenAddresses. I think we spend ~50% of our time just parsing CSV (for a CSV source like dk or nl). Switching from DictReader to regular reader() is the least pain. Concretely, for a 60 minute job that’d bring the time down to about 40–45 minutes. A nice improvement, but not life altering. Switching to Python3 so we no longer need unicodecsv would also save about the same amount of time.

Python 2 results

 0.17s catDevNull
 0.25s wc
 1.91s pythonRead
 0.58s pythonReadLine
 6.62s dumbCsv
 10.97s csvReader
 29.63s csvDictReader
 62.82s unicodeCsvReader
120.93s unicodeCsvDictReader
 27.93s pandasCsv

Python 3 results

 0.17s catDevNull
 0.25s wc
 5.18s pythonRead
 3.50s pythonReadLine
11.83s dumbCsv
27.77s csvReader
51.37s csvDictReader

The Code

# Benchmark various ways of reading a csv file
# Code works in Python2 and Python3

import sys, time, os, csv, timeit
    import unicodecsv
    unicodecsv = None
    import pandas
    pandas = None

fn = sys.argv[1]

def warmCache():
    os.system('cat %s > /dev/null' % fn)

def catDevNull():
    os.system('cat %s > /dev/null' % fn)

def wc():
    os.system('wc -l %s > /dev/null' % fn)

def pythonRead():
    fp = open(fn)

def pythonReadLine():
    fp = open(fn)
    for l in fp:

def csvReader():
    reader = csv.reader(open(fn, 'r'))
    for l in reader:

def unicodeCsvReader():
    reader = unicodecsv.reader(open(fn, 'r'))
    for l in reader:

def csvDictReader():
    reader = csv.DictReader(open(fn, 'r'))
    for l in reader:

def unicodeCsvDictReader():
    reader = unicodecsv.DictReader(open(fn, 'r'))
    for l in reader:

def dumbCsv():
    'Really simplistic CSV style parsing'
    fp = open(fn, 'r')
    for l in fp:
        d = l.split(',')

def pandasCsv():
    d = pandas.read_csv(fn, encoding='utf-8')
    # Ensure pandas really read the whole thing

def run(f):
    "Run a function, return time to execute it."
    # timeit is complex overkill
    s = time.time()
    e = time.time()
    return e-s


functions = [catDevNull, wc, pythonRead, pythonReadLine, dumbCsv, csvReader, csvDictReader]
if unicodecsv:
if pandas:

for f in functions:
    t = run(f)
    print('%.2fs %s' % (t, f.__name__))

OpenAddresses optimization: some baseline timings

Some rough notes for optimizing openaddresses conform. These times are from a January 23 run of the Python code. The machine was busy running 8x jobs, so times may be a bit inflated from true, but it’s a start.

Here’s 3 sources that I know to be slow because of our own code. The time reported here are purely the time doing conform after the thing was downloaded.

  • nl (csv source): 35 minutes, 14.8M rows
  • dk (csv source): 43 minutes, 3.4M rows
  • au-victoria (shapefile source): 46 minutes, 3.4M rows
  • ??? ESRI source. No good examples; most of my conform code treats this as effectively CSV anyway, so going to ignore for now.

I just ran nl again and it took 31.5 minutes (26 minutes user, 5 minutes sys). Close enough, I’ll take these January 23 times as still indicative. At least for CSV sources.

Here’s some fast / short sources I can use for testing. These are total times including network.

  • us-ca-palo_alto.json (csv source) 26 seconds
  • ca-bc-north_cowichan.json (csv source) 24 seconds
  • us-wa-chelan.json (shapefile source) 33 seconds

And here’s a report of top slow jobs that didn’t actually time out. Some of this slowness is due to network download time.

3521s us-va-city_of_chesapeake.json
2807s au-victoria.json
2765s us-ca-marin_county.json
2589s dk.json
2116s nl.json
2032s us-sc-aiken.json
1660s us-va-new_kent.json
1639s es-25830.json
1541s us-nc-alexander.json
1498s us-va.json
1367s us-va-fairfax.json
1352s us-sd.json
1345s us-ca-los_angeles_county.json
1325s us-mn-ramsey.json
1216s us-al-calhoun.json
1015s us-mi-kent.json
973s us-ms-hinds.json
955s us-wa-skagit.json
937s us-tn-rutherford.json
918s us-ca-solano_county.json
918s us-nc.json
786s us-fl-palm_beach_county.json
783s us-wa-seattle.json
776s us-wa-king.json
769s be-flanders.json
762s us-sc-laurens.json
729s us-wy-natrona.json
691s us-il-mchenry.json
682s us-tx-houston.json
678s us-al-montgomery.json
656s pl.json

grep Finished *.log | sort -nr  -k 9 | cut -c 65- | sed ‘s%for /var/opt/openaddresses/sources/%%’ | head -50

Here’s some quicky cProfile output, sorted by cumulative time.


python -m openaddr.conform ~/src/oa/profile/sources-fast/ca-bc-north_cowichan.json ~/src/oa/profile/caches/cowichan.csv /tmp/o/foo
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.053    6.053 <string>:1(<module>)
        1    0.000    0.000    6.053    6.053 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
        1    0.000    0.000    6.049    6.049 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
        1    0.000    0.000    3.628    3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
        1    0.042    0.042    3.628    3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:433(csv_source_to_csv)
    30260    0.170    0.000    3.241    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
    30260    0.163    0.000    2.958    0.000 /usr/lib/python2.7/csv.py:104(next)
    30262    2.639    0.000    2.715    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
        1    0.042    0.042    2.421    2.421 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
    15129    0.066    0.000    1.328    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:510(row_extract_and_reproject)
559802/15130    0.636    0.000    1.256    0.000 /usr/lib/python2.7/copy.py:145(deepcopy)
15132/15130    0.147    0.000    1.219    0.000 /usr/lib/python2.7/copy.py:253(_deepcopy_dict)
    30260    0.028    0.000    0.822    0.000 /usr/lib/python2.7/csv.py:151(writerow)
    30260    0.033    0.000    0.599    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
    15129    0.026    0.000    0.571    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
    30262    0.085    0.000    0.402    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
   332882    0.147    0.000    0.312    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
   509313    0.236    0.000    0.309    0.000 /usr/lib/python2.7/copy.py:267(_keep_alive)
    15129    0.025    0.000    0.256    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
    30260    0.170    0.000    0.196    0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)


python -m openaddr.conform ~/src/oa/profile/sources-fast/us-wa-chelan.json ~/src/oa/profile/caches/chelan/*shp /tmp/o/foo

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   29.549   29.549 <string>:1(<module>)
        1    0.000    0.000   29.549   29.549 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
        1    0.000    0.000   29.545   29.545 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
        1    0.000    0.000   19.640   19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
        1    3.952    3.952   19.640   19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:343(ogr_source_to_csv)
        1    0.163    0.163    9.903    9.903 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
    44111    0.367    0.000    6.508    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
  1764400    2.433    0.000    6.275    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3012(GetField)
    44111    0.324    0.000    5.837    0.000 /usr/lib/python2.7/csv.py:104(next)
    44112    5.138    0.000    5.370    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
    88222    0.095    0.000    3.969    0.000 /usr/lib/python2.7/csv.py:151(writerow)
    88222    0.120    0.000    2.759    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
    44110    0.078    0.000    2.512    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
    88224    0.450    0.000    1.791    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
   972727    0.595    0.000    1.578    0.000 {method 'decode' of 'str' objects}
    44110    0.078    0.000    1.487    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
  1764440    0.502    0.000    1.328    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3183(GetFieldDefn)
  2029152    0.652    0.000    1.325    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
    88222    0.933    0.000    1.114    0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)
  7149945    1.114    0.000    1.114    0.000 {isinstance}
  1764400    0.537    0.000    1.109    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3478(GetNameRef)
  1764400    0.440    0.000    0.988    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2552(IsFieldSet)
   972728    0.309    0.000    0.983    0.000 /usr/lib/python2.7/encodings/utf_8.py:15(decode)
  1764400    0.504    0.000    0.960    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2277(GetFieldCount)
    44110    0.715    0.000    0.920    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:592(<dictcomp>)
    88222    0.848    0.000    0.848    0.000 {method 'writerow' of '_csv.writer' objects}
  1764440    0.826    0.000    0.826    0.000 {_ogr.FeatureDefn_GetFieldDefn}
   972727    0.258    0.000    0.782    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2335(GetFieldAsString)
   972728    0.674    0.000    0.674    0.000 {_codecs.utf_8_decode}
    44111    0.026    0.000    0.636    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:1190(GetNextFeature)

Profile conclusions

For a CSV source, we spend roughly half the time converting source CSV to extracted and half converting extracted to final output. That’s no surprise; the two CSV files are nearly identical. A whole lot of that time is spent deep in the bowels of Python’s CSV module reading rows. Again no surprise, but it confirms my suspicion that DictReader may be doing more work than we’d like.

For a shapefile source, we spend roughly 2/3 of the time using OGR to convert to CSV and 1/3 of the time converting the intermediate CSV to the final output. The OGR code is opaque, not clear how to figure out what it’s really spending time doing inside the C module.

Not clear what conclusions to draw here; there never is with profiling tools. I think my next step should be benchmarking Python’s CSV DictReader and seeing whether some simpler parsing would work significantly faster. I also think it’d be a huge improvement to remove that intermediate CSV file entirely, there’s a lot of overhead reading and writing it. It makes the code way simpler to work with but it should be possible to stream the data in memory and retain most of the same code model.

Not clear any of this optimization is worth the effort.

Compare two OpenAddress runs

I’m re-running OpenAddress many times trying to be sure the output stays roughly consistent. It’s never the same twice; servers are unreliable and the data is changing. Also some code changes make small cosmetic differences like rounding error.

What works best for me is comparing the line counts in the output files:

wc -l out/*/out.csv > wc.txt

diff  –suppress-common-lines -y oa-full-790/wc.txt oa-new-esri/wc.txt

This only highlights sources that output different number of lines; if the contents in the columns is garbled you won’t see that. But it’s a good way to get an overview of what changed between two runs.

OpenAddress Machine mark 3

this post is a work in progress


When I’m back from India late February I’d like to re-design OpenAddress machine to a more decoupled architecture. Right now the system runs one big Python process which runs all sources from download all the way to output CSV. (There is some state carried over from run to run, but not much.) It works pretty well and is simple, and a full run just takes 2–3 hours. But the multiprocessing stuff is brittle and the whole system is both overkill (most sources never change, why reprocess?) and also non-responsive (no fast way to test a new source).


Move to a decoupled architecture where tasks are run as fully independent Unix processes. They share state via some centralized database / cloud storage thingy, and also post results in S3 for web serving.

As with the current architecture, the unit of work in the job system is a single source file such as us-ca-san_francisco. It runs as an independent Unix process with no awareness of other sources. It downloads, conforms, summarizes, reports, and uploads its results to the database and to S3.

In addition we also need a task queueing system, something that decides when to run source jobs and posts them on the task queue. That process works mostly by reading the database and deciding what work needs doing. Some rules: “immediately process a new source”. “Check a source weekly for new data and reprocess”. “Clear the whole queue and re-process it”.

Finally we need a task executor that looks at the queue and launches tasks as needed. This could be part of the task queuer, but it helps to think of it as a separate entity. When a task is finished it puts the result on the task record log.

Database implementation

I’m totally agnostic about the data store we use to track runs. It’s a very low demand system just a few hundred megs of data and a handful of transactions a minute. It will require something with consistency and responsiveness. The simple choice is a PostGIS database running persistently on some 24/7 server. I’m not sure what the cloud services equivalent would be, so running with the PostGIS idea, here’s some schema sketches:

Completed task records

  • Task ID
  • Time started
  • Time finished
  • Source name
  • Success / failure
  • S3 output paths: out.csv result, debug logs, etc
  • Whatever else write_state does.

Task queue

  • There’s a zillion ways to do a task queue in a database. Rows are placed in the table by the task queuer, then removed by the task launcher when they are dispatched.

Task implementation

A task should be a single Python program. It takes a source.json specification file and any job metadata that’s needed (hopefully none?) and executes it. For debugging purposes this program should be able to be run standalone on a developer’s machine with no dependencies on the queuing system, the task manager database, or S3.

These jobs will be a natural to run on cheap EC2 spot instances. Right now tasks take 0.5 – 90 minutes to run. A bit wasteful to spin up a whole EC2 instance for 30 seconds of work, but maybe that’s OK. We could also run tasks on a single workhorse machine, that’s effectively what we’re doing now. A single mid-level Linux box can run 16 tasks in parallel. (In fact parallelism is good; the jobs are a mix of load types.)

Alternate: subtasks

This proposal assumes that a whole source from start to finish is run as a single Python process. But in fact processing a source consists of several sub tasks, and it might be good to run them as separate sub-tasks. Here’s what the subtasks are:

  • Download from source URL / ESRI store
  • Sample the source data for user inspection
  • Convert the source JSON/CSV/SHP data to an extracted.csv file
  • Conform the extracted.csv data to OpenAddresses out.csv format
  • Upload out.csv and job data to S3

There’s times i’ve wanted to be able to execute these subtasks separately. Particularly the download part, that is slow and unreliable. To some extent the current data caching strategy in Mark 1 and 2 is dealing with that, and it may be sufficient. But you could imagine breaking every source into 5 tasks and running them separately on a job queue. Worth considering.


Why “mark 3”?

OpenAddress Machine mark 1 was Mike’s original work, wrapping the Node code in a bunch of Python scripts to do regular runs and post results in a web dashboard. It is awesome.

We’ve been working on mark 2, the “ditch-node” branch, where we rewrote all the Node code in Python. The overall job architecture is about the same as mark 1, the management of downloads and files and stuff. It’s one big Python process with jobs run in parallel using multiprocessing. We did make some changes to how tasks are conceived and run.


Some notes after this post was written

  • Ian notes that GitHub triggers and Jenkins could trigger a build on a new source

multiprocessing worker processes dying

Python’s multiprocessing.Pool has a design wrinkle that’s a bit awkward. If you have a pool with N tasks and one of those task subprocesses dies unexpectedly (say, to a SIGTERM or something) then the pool hangs. It looks like N-1 tasks have finished and there’s still one waiting. But it will never complete and your parent process will effectively be stuck. Note that normal termination doesn’t do that, including random exceptions, SIGINT from Ctrl-C, etc. A normal “kill” triggers this though, as certainly does a “kill -9”. Probably a segfault in the Python interpreter will too.

Personally I think this is a bad design choice, but it’s not by accident. There was a huge discussion about this behavior three years ago. I haven’t read it all, but most of the comments seem to be about the wisdom and difficulty of recovering from a bad state. The ticket got closed after someone committed some changes to concurrent.futures (Python 3’s preferred new library). Nothing changed in multiprocessing.Pool.

Recently this issue was revisited for multiprocessing.Pool with a new bug filed that includes a patch. The approach there (and concurrent.futures) is if a child dies unexpectedly, you want to kill the whole Pool immediately with a BrokenProcessPool exception. I’m not wild about this choice, but it’s definitely better than hanging.

None of this applies to the Python distribution we’re running today. The pool will hang. For OpenAddresses I suggest we work around the issue by simply not killing workers. If you want to abort a worker early, try SIGALRM. We could also install a SIGTERM handler to catch the simple “kill” case from an operator, but I’m not sure that’s wise.

In addition, OpenAddresses also has a SIGUSR1 handler that allows someone externally to shut down the whole pool. It’s good for recovering from this state.

Update: we hit this bug again in a new way. Some of the worker processes were getting killed by the Linux OOM Killer. The Python code doesn’t see any exception, it’s just a SIGTERM or something. Only way you know is a record in the syslog. (And the multiprocessing debug logs show a new worker started.)