Installing openaddress-machine on a new EC2 system using Chef

No need to install stuff manually; Mike already wrapped up scripts to set up an EC2 system with Chef for us. Here’s how to use it on a brand new EC2 micro server

sudo bash
apt-get update
apt-get upgrade
apt-get install git
git clone https://github.com/openaddresses/machine.git
cd machine/chef
./run.sh

Done! The shell command openaddr-process-one now works and does stuff.

In brief, this:

  1. installs Chef and Ruby via apt
  2. runs a Python setup recipe. That installs a few Ubuntu Python packages with apt (including GDAL and Cairo), then does a “pip install” in the OpenAddress machine directory. This tells pip to install a bunch of other Python stuff we use.
  3. runs a recipe for OpenAddresses. This uses git to put the source JSON data files in /var/opt.

Update

But really, that’s so manual. If you just pip install openaddr-machine it makes a /usr/local/bin/openaddr-ec2-run script that will do the work for you. That in turn invokes a run.py script which you run on your local machine. It, among other things, runs a templated shell script to set up an EC2 instance and run the job on it.

The shell script that is run on EC2 is pretty basic. It:

  1. Updates apt (but does not upgrade)
  2. Installs git and apache2
  3. clones the openaddress-machine repo into /tmp/machine
  4. Runs scripts to setup swap on the machine, then invoke chef to set up the machine
  5. Runs openaddr-process to do the job
  6. Shuts the machine down.

The run.py script you use on your own machine is mostly about getting an EC2 instance.

  1. De-template the shell script and put it in user_data.
  2. Use boto.ec2 to bid on a spot instance
  3. Wait for up to 12 hours until we get our instance

The details of how the EC2 instance is bid for, created, and waited on are a bit funky but seem well contained.

Installing openaddresses-machine on a new EC2 instance

Just documenting some work I done. The pattern here follows my notes on making openaddresses work in virtualenv, although I didn’t use venv here. This is really how to get the minimal stuff installed on a new Ubuntu box to run our code. Just a bunch of shell commands.

Set up an Ubuntu 14.04 AMI. The smallest one will run it OK

Do the following as root:

# update  ubuntu
apt-get update
apt-get upgrade

# install pip
cd /tmp
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

# install gcc and python dev
apt-get install gcc g++ python-dev

# install openaddresses
pip install Openaddresses-Machine

# install Cairo
apt-get install libffi-dev libcairo2
pip install cairocffi

# install GDAL
apt-add-repository ppa:ubuntugis/ubuntugis-unstable
apt-get update
apt-get install python-gdal libgdal-dev
pip install GDAL

I haven’t run these commands a second time yet, but it should be close. The last line for “pip install GDAL” is probably not necessary, and it’d probably be better to install Openaddresses-Machine last although it may not matter.

Next step: automate this with Chef Solo.

After that: set up Honcho and a Procfile to run my queue service worker.

OpenAddress machine continuous queue

Migurski and I have been working up a new architecture for OpenAddresses, specifically the python code that runs source jobs. Currently everything runs as a big batch job every two days. The primary goal is to get a system that can run a new submitted source instantly, ideally with useful feedback to the submitter.

Mike’s insight was to build a system a little like Travis; something that looks at GitHub commits with a hook, runs a new job every time something is committed, and posts back info via the GitHub status API. GitHub is nice because it provides a file store and change tracking. Ultimately we don’t want to require a GitHub login to submit sources, and we probably want a custom UI, but that can come later.

Here’s the components we’re talking about building:

  1. A GitHub webhook that submits info to our new job server when stuff is checked in.
  2. A new job server that accepts HTTP posts of GitHub events. Mike’s got one working in Flask running on Heroku.
  3. A job queue. New jobs get enqueued by the new job server, then some worker daemon process checks for jobs to run and executes them. Can be a very simple queue, but needs to support job priorities (interactive vs. batch). Right now I’m thinking of writing my own using Postgres running in Amazon RDS, but there’s got to be something I can just use that’s not super overkill. (Update: pq looks good.)
  4. Workers to run jobs. Something to look for new jobs and execute them. Current thought is a persistent Unix server running on EC2. Installs openaddr and its dependencies via pip, then runs jobs via openaddr-process-one, then posts results back to the job queue result queue.
  5. Job completion notifications back to GitHub using the GitHub status API. Mike has been tinkering with code to do this already.
  6. A batch-mode refresher. Something to run sources periodically even if no one’s edited the source specification in awhile. Mostly to catch new data on the remote server.

Here’s some open things to design.

  1. Job status reporting with humane UI. Currently we get very little info back from when a job is run. A few bits of status (did it complete, how long it took), a debug log from the process, and any actual output files. We need more. In particular I think the stuff you currently learn from the debug log needs to be expanded to useful user-level event data. Ie, events like “source JSON parsed”, “source data downloaded”, etc.
  2. UI for creating source specification JSON files. Imagining some guided wizard like UI using Javascript and forms. There’s already a pretty good one on the live site, just needs some expansion. Also ultimately we shouldn’t require a GitHub login.
  3. Security. A bit nervous running completely unreviewed source specs from anonymous people. Isolating the worker in a throwaway machine might help. Or doing some sanitization on inputs.
  4. Statistics reporting. I have a dashboard that works OK for the existing nightly batch mode, but we’ll want something different for this continuous mode thing.

See also OpenAddresses Mark 3, a proposal I sketched in January. What we’re doing here is similar but without the component of breaking any individual source run up into separate tasks.

Python3 est arrivé

I’m now using Python3 by default for all new projects. Almost entirely because I prefer sane Unicode handling. But really because Py3 has turned the corner and is now useful for most things. Probably it turned that corner a year or more ago, but it was the OpenAddresses work that made me start using Python 3 regularly. It helps that Homebrew and Debian both make it very easy to install ‘python3’ and ‘pip3’. And of course that most packages I care about are now ported over.

It took Python five years longer than I expected for the Py3 transition to get to this point. Let’s hope we don’t have to do it again.

Address schema

Some stuff of possible interest to OpenAddresses. The US government’s Federal Geographic Data Committee standardized street address schema. The docs are 140 pages of text describing an XML schema, which sure seems wordy, but then it’s also very precise. They even put the XML schema online, here’s a text rendering.

It’s super wordy and contains a lot of detail, but at the core is a schema we could repurpose that says things like “a street address consists of a number, a prefix, a street name, …”

There’s a good summary of the schema on page 3 of the executive summary. Here’s the meat of it:

Screen Shot 2015-04-09 at 10.57.41 AM

installing openaddresses-machine in a virtualenv

Trying to get a server up running OpenAddresses in a controlled environment. I’m normally very sloppy and install everything as root in /usr, but I figured I should use virtualenv this time. Machine actually has a very nice Chef setup and bootstraps itself regularly in EC2 and Travis contexts, but I’m stubborn and doing it my own way.

Install openaddresses-machine and it’s simple dependencies

  • Prepare a virtualenv directory and activate it
  • pip install Openaddresses-Machine

Install cairocffi. This required a C library I didn’t have

  • sudo apt-get install libffi-dev
  • pip install cairocffi

Install Python GDAL. This is a mess; I’m not sure why the simple “pip install GDAL” doesn’t work from inside the virtualenv. And I’m not sure the instructions below are correct; it’s probably installing python-gdal globally on the system via and then again in the virtualenv via pip. But that gets all the C dependencies we need somewhere on the system. There’s extra rigamarole to get the bleeding edge GDAL instead of the stock Ubuntu GDAL. Also building GDAL requires a C++ compiler.

  • apt-get install software-properties-common
  • sudo apt-add-repository ppa:ubuntugis/ubuntugis-unstable
  • sudo apt-get update
  • sudo apt-get install python-gdal libgdal-dev
  • sudo apt-get install g++
  • pip install GDAL
  • PSYCH! That won’t work. Follow the instructions in this gist for how to manually configure and install gdal. Apparently its packaging is not compatible with pip?

Some OpenAddresses machine stats

I’ve collected the state.txt files written by OpenAddress machine and stuffed them in a database for ad hoc reporting. Here’s some queries and results

Slowest processing

select source, max(address_count), round(avg(process_time)) as time
from stats where address_count > 0
group by source order by time desc limit 10;

source                          max(addres  time
------------------------------  ----------  ----------
nl.json                         14846597    8769.0
au-victoria.json                3426983     8438.0
es-25830.json                   8778892     6450.0
us-va.json                      3504160     4377.0
dk.json                         3450853     4220.0
pl-mazowieckie.json             995946      4108.0
us-ny.json                      4229542     4047.0
pl-lodzkie.json                 685093      3575.0
pl-slaskie.json                 575596      3331.0
pl-malopolski.json              670892      3216.0

Slowest cache

source                          max(addres  time
------------------------------  ----------  ----------
us-ny.json                      4229542     5808.0
za-nl-ethekwini.json            478939      3617.0
us-va-city_of_chesapeake.json   99197       3457.0
us-ca-marin_county.json         104067      3034.0
us-mi-ottawa.json               109202      2232.0
us-fl-lee.json                  548264      1999.0
us-ne-lancaster.json            97859       1995.0
us-ia-linn.json                 95928       1519.0
dk.json                         3450853     1348.0
us-az-mesa.json                 331816      890.0

Summary of runs

For each run, print out how many sources we tried, how many cached, how many sampled, and how many processed. The counting here is a bit bogus but I think close.

select ts, count(distinct(source)), count(distinct(cache_url)), count(distinct(sample_url)), count(distinct(processed_url))
from stats group by ts;
ts               addresses        source  caches  sample  proces
---------------  ---------------  ------  ------  ------  ------
1420182459                        737     689     557     490
1420528178                        738     691     559     490
1420787323                        740     692     560     495
1421133148                        789     717     558     521
1421392045                        790     718     559     523
1421737732                        790     719     557     521
1421996867                        790     719     558     522
1422342479                        790     718     560     524
1422601615                        790     652     582     503
1422645440                        790     655     585     511
1422748762                        790     650     580     504
1422773454                        790     647     577     506
1422860989                        790     653     582     508
1423034051                        790     658     588     509
1423206384                        790     658     587     508
1423465798                        790     660     588     510
1423638409                        790     659     588     508
1423811369                        790     661     590     508
1424070358                        790     657     586     509
1424243321                        790     645     566     499
1424416039                        790     650     534     469
1424675318        88031206        790     648     563     496
1424849197       111999281        794     658     570     505
1425022188       105056481        794     653     570     508
1425280104       109864743        794     651     567     507
1425453036       111734276        786     655     573     520
1425625770       116323932        786     656     571     519
1425881358       113713353        786     644     563     528
1426054556       117771563        788     666     586     534
1426228915       117335107        788     666     584     538
1427090950       113916633        788     665     581     527

Time to complete successful runs

 select ts, round(avg(cache_time)) as cache, round(avg(process_time)) as process 
 from stats where length(processed_url) > 1 
 group by ts order by ts;
ts               cache            proces
---------------  ---------------  ------
1420182459       63.0             154.0
1420528178       50.0             147.0
1420787323       46.0             166.0
1421133148       43.0             159.0
1421392045       43.0             160.0
1421737732       43.0             163.0
1421996867       45.0             167.0
1422342479       46.0             164.0
1422601615       48.0             228.0
1422645440       56.0             314.0
1422748762       47.0             300.0
1422773454       50.0             289.0
1422860989       61.0             296.0
1423034051       66.0             263.0
1423206384       44.0             260.0
1423465798       51.0             294.0
1423638409       51.0             270.0
1423811369       50.0             260.0
1424070358       55.0             277.0
1424243321       60.0             265.0
1424416039       57.0             298.0
1424675318       51.0             265.0
1424849197       62.0             323.0
1425022188       53.0             284.0
1425280104       59.0             312.0
1425453036       55.0             325.0
1425625770       79.0             367.0
1425881358       77.0             315.0
1426054556       84.0             335.0
1426228915       76.0             361.0
1427090950       71.0             272.0

Most improved

A count of which sources have the most variance in how many address lines we got out. kinda bogus, but fun.

select source, max(address_count)-min(address_count) as diff, min(address_count) as minc, max(address_count) as maxc
from stats where address_count > 0 
group by source order by diff desc limit 15;

source                          diff        minc        maxc
------------------------------  ----------  ----------  ----------
us-mn-dakota.json               138301      27218       165519
us-pa-allegheny.json            53399       474492      527891
us-ne-lancaster.json            44996       52863       97859
us-va-salem.json                9992        614         10606
dk.json                         5258        3445595     3450853
be-flanders.json                2809        3365557     3368366
us-nc-mecklenburg.json          2117        499524      501641
us-co-larimer.json              1118        169500      170618
ca-on-hamilton.json             1112        243554      244666
us-va-james_city.json           1071        33383       34454
us-tx-austin.json               837         374854      375691
us-or-portland.json             786         415220      416006
us-nc-buncombe.json             784         147492      148276
us-nc-union.json                720         81241       81961
us-oh-williams.json             603         18496       19099

Census of US county parcel records

Fairview Industries has a great map of which US counties publish GIS parcel data. Mike got the shapefile for the database. I went ahead and extracted just the URLs and county names to a convenient CSV file.

The file lists a total of 3214 counties. 2022 have ViewURLs defined and 590 have DownURLs defined.

Here’s my extractor code.

#!/usr/bin/env python3

import fiona, sys, csv

out = csv.writer(open('extracted.csv', 'w', encoding='utf-8'))

out.writerow(('FIPS', 'State', 'County', 'ViewURL', 'DownURL'))

count = 0
gisInvLks = 0
viewUrls = 0
downUrls = 0
with fiona.drivers():
    with fiona.open('counties.shp') as counties:
        for county in counties:
            props = county['properties']
            # Fix mojibake; Fiona read these strings as ISO-Latin-1 but they are actually UTF-8
            cntyNm = props['CntyNm'].encode('latin_1')
            cntyString = cntyNm.decode('utf-8')
            out.writerow((props['StCntyFIPS'], props['StNm'], cntyString, props['ViewURL'], props['DownURL']))
            count += 1
            if props['ViewURL']:
                viewUrls += 1
            if props['DownURL']:
                downUrls += 1

sys.stderr.write('%d rows\n%d ViewURLs\n%d DownURLs\n' % (count, viewUrls, downUrls))

Python CSV benchmarks

I tested various ways of reading a CSV file in python, from simply reading the file line by line to using the full unicodecsv DictReader. Here’s what I discovered. Test data is dk.csv, a 3.4M row CSV file with 46 columns. (See also: file reading benchmarks.)

  • The Python2 csv module takes 2x longer than a naive split(‘,’) on every line
  • Python2 DictReader takes 2-3x longer than the simple csv reader that returns tuples
  • Python2 unicodecsv takes 5.5x longer than csv
  • Python3 csv takes 2-3x longer than Python2 csv. However it is Unicode-correct
  • Pandas in Python2 is about the same speed as DictReader, but is Unicode-correct.

I’m not sure why unicodecsv is so slow. I did a quick look in cProfile and all the time is being spent in next() where you’d expect. All those isinstance tests add significant time (20% or so) but that’s not the majority of the 5.5x slowdown. I guess string decoding is just a lot of overhead in Python2? It’s not trivial in Python3 either; I hadn’t realized how much slower string IO was in Py3. I wonder if there’s more going on. Anyway I filed an issue on unicodecsv asking about performance.

I’ve never used Pandas before. I ran into someone else saying unicodecsv is slow who switched to Pandas. It sure is fast! I think it’s a lot of optimized C code. But Pandas is a big package and has its own model of data and I don’t know that I want to buy into all of that. Its CSV module is nicely feature-rich though.

Not sure what conclusion to draw for OpenAddresses. I think we spend ~50% of our time just parsing CSV (for a CSV source like dk or nl). Switching from DictReader to regular reader() is the least pain. Concretely, for a 60 minute job that’d bring the time down to about 40–45 minutes. A nice improvement, but not life altering. Switching to Python3 so we no longer need unicodecsv would also save about the same amount of time.

Python 2 results

 0.17s catDevNull
 0.25s wc
 1.91s pythonRead
 0.58s pythonReadLine
 6.62s dumbCsv
 10.97s csvReader
 29.63s csvDictReader
 62.82s unicodeCsvReader
120.93s unicodeCsvDictReader
 27.93s pandasCsv

Python 3 results

 0.17s catDevNull
 0.25s wc
 5.18s pythonRead
 3.50s pythonReadLine
11.83s dumbCsv
27.77s csvReader
51.37s csvDictReader

The Code

# Benchmark various ways of reading a csv file
# Code works in Python2 and Python3

import sys, time, os, csv, timeit
try:
    import unicodecsv
except: 
    unicodecsv = None
try:
    import pandas
except:
    pandas = None

fn = sys.argv[1]

def warmCache():
    os.system('cat %s > /dev/null' % fn)

def catDevNull():
    os.system('cat %s > /dev/null' % fn)

def wc():
    os.system('wc -l %s > /dev/null' % fn)

def pythonRead():
    fp = open(fn)
    fp.read()

def pythonReadLine():
    fp = open(fn)
    for l in fp:
        pass

def csvReader():
    reader = csv.reader(open(fn, 'r'))
    for l in reader:
        pass

def unicodeCsvReader():
    reader = unicodecsv.reader(open(fn, 'r'))
    for l in reader:
        pass

def csvDictReader():
    reader = csv.DictReader(open(fn, 'r'))
    for l in reader:
        pass

def unicodeCsvDictReader():
    reader = unicodecsv.DictReader(open(fn, 'r'))
    for l in reader:
        pass

def dumbCsv():
    'Really simplistic CSV style parsing'
    fp = open(fn, 'r')
    for l in fp:
        d = l.split(',')

def pandasCsv():
    d = pandas.read_csv(fn, encoding='utf-8')
    # Ensure pandas really read the whole thing
    d.tail(1)

def run(f):
    "Run a function, return time to execute it."
    # timeit is complex overkill
    s = time.time()
    f()
    e = time.time()
    return e-s

warmCache()

functions = [catDevNull, wc, pythonRead, pythonReadLine, dumbCsv, csvReader, csvDictReader]
if unicodecsv:
    functions.append(unicodeCsvReader)
    functions.append(unicodeCsvDictReader)
if pandas:
    functions.append(pandasCsv)

for f in functions:
    t = run(f)
    print('%.2fs %s' % (t, f.__name__))

OpenAddresses optimization: some baseline timings

Some rough notes for optimizing openaddresses conform. These times are from a January 23 run of the Python code. The machine was busy running 8x jobs, so times may be a bit inflated from true, but it’s a start.

Here’s 3 sources that I know to be slow because of our own code. The time reported here are purely the time doing conform after the thing was downloaded.

  • nl (csv source): 35 minutes, 14.8M rows
  • dk (csv source): 43 minutes, 3.4M rows
  • au-victoria (shapefile source): 46 minutes, 3.4M rows
  • ??? ESRI source. No good examples; most of my conform code treats this as effectively CSV anyway, so going to ignore for now.

I just ran nl again and it took 31.5 minutes (26 minutes user, 5 minutes sys). Close enough, I’ll take these January 23 times as still indicative. At least for CSV sources.

Here’s some fast / short sources I can use for testing. These are total times including network.

  • us-ca-palo_alto.json (csv source) 26 seconds
  • ca-bc-north_cowichan.json (csv source) 24 seconds
  • us-wa-chelan.json (shapefile source) 33 seconds

And here’s a report of top slow jobs that didn’t actually time out. Some of this slowness is due to network download time.

3521s us-va-city_of_chesapeake.json
2807s au-victoria.json
2765s us-ca-marin_county.json
2589s dk.json
2116s nl.json
2032s us-sc-aiken.json
1660s us-va-new_kent.json
1639s es-25830.json
1541s us-nc-alexander.json
1498s us-va.json
1367s us-va-fairfax.json
1352s us-sd.json
1345s us-ca-los_angeles_county.json
1325s us-mn-ramsey.json
1216s us-al-calhoun.json
1015s us-mi-kent.json
973s us-ms-hinds.json
955s us-wa-skagit.json
937s us-tn-rutherford.json
918s us-ca-solano_county.json
918s us-nc.json
786s us-fl-palm_beach_county.json
783s us-wa-seattle.json
776s us-wa-king.json
769s be-flanders.json
762s us-sc-laurens.json
729s us-wy-natrona.json
691s us-il-mchenry.json
682s us-tx-houston.json
678s us-al-montgomery.json
656s pl.json

grep Finished *.log | sort -nr  -k 9 | cut -c 65- | sed ‘s%for /var/opt/openaddresses/sources/%%’ | head -50

Here’s some quicky cProfile output, sorted by cumulative time.

ca-bc-north_cowichan

python -m openaddr.conform ~/src/oa/profile/sources-fast/ca-bc-north_cowichan.json ~/src/oa/profile/caches/cowichan.csv /tmp/o/foo
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.053    6.053 <string>:1(<module>)
        1    0.000    0.000    6.053    6.053 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
        1    0.000    0.000    6.049    6.049 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
        1    0.000    0.000    3.628    3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
        1    0.042    0.042    3.628    3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:433(csv_source_to_csv)
    30260    0.170    0.000    3.241    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
    30260    0.163    0.000    2.958    0.000 /usr/lib/python2.7/csv.py:104(next)
    30262    2.639    0.000    2.715    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
        1    0.042    0.042    2.421    2.421 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
    15129    0.066    0.000    1.328    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:510(row_extract_and_reproject)
559802/15130    0.636    0.000    1.256    0.000 /usr/lib/python2.7/copy.py:145(deepcopy)
15132/15130    0.147    0.000    1.219    0.000 /usr/lib/python2.7/copy.py:253(_deepcopy_dict)
    30260    0.028    0.000    0.822    0.000 /usr/lib/python2.7/csv.py:151(writerow)
    30260    0.033    0.000    0.599    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
    15129    0.026    0.000    0.571    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
    30262    0.085    0.000    0.402    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
   332882    0.147    0.000    0.312    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
   509313    0.236    0.000    0.309    0.000 /usr/lib/python2.7/copy.py:267(_keep_alive)
    15129    0.025    0.000    0.256    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
    30260    0.170    0.000    0.196    0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)

us-wa-chelan

python -m openaddr.conform ~/src/oa/profile/sources-fast/us-wa-chelan.json ~/src/oa/profile/caches/chelan/*shp /tmp/o/foo

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   29.549   29.549 <string>:1(<module>)
        1    0.000    0.000   29.549   29.549 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
        1    0.000    0.000   29.545   29.545 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
        1    0.000    0.000   19.640   19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
        1    3.952    3.952   19.640   19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:343(ogr_source_to_csv)
        1    0.163    0.163    9.903    9.903 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
    44111    0.367    0.000    6.508    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
  1764400    2.433    0.000    6.275    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3012(GetField)
    44111    0.324    0.000    5.837    0.000 /usr/lib/python2.7/csv.py:104(next)
    44112    5.138    0.000    5.370    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
    88222    0.095    0.000    3.969    0.000 /usr/lib/python2.7/csv.py:151(writerow)
    88222    0.120    0.000    2.759    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
    44110    0.078    0.000    2.512    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
    88224    0.450    0.000    1.791    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
   972727    0.595    0.000    1.578    0.000 {method 'decode' of 'str' objects}
    44110    0.078    0.000    1.487    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
  1764440    0.502    0.000    1.328    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3183(GetFieldDefn)
  2029152    0.652    0.000    1.325    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
    88222    0.933    0.000    1.114    0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)
  7149945    1.114    0.000    1.114    0.000 {isinstance}
  1764400    0.537    0.000    1.109    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3478(GetNameRef)
  1764400    0.440    0.000    0.988    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2552(IsFieldSet)
   972728    0.309    0.000    0.983    0.000 /usr/lib/python2.7/encodings/utf_8.py:15(decode)
  1764400    0.504    0.000    0.960    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2277(GetFieldCount)
    44110    0.715    0.000    0.920    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:592(<dictcomp>)
    88222    0.848    0.000    0.848    0.000 {method 'writerow' of '_csv.writer' objects}
  1764440    0.826    0.000    0.826    0.000 {_ogr.FeatureDefn_GetFieldDefn}
   972727    0.258    0.000    0.782    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2335(GetFieldAsString)
   972728    0.674    0.000    0.674    0.000 {_codecs.utf_8_decode}
    44111    0.026    0.000    0.636    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:1190(GetNextFeature)

Profile conclusions

For a CSV source, we spend roughly half the time converting source CSV to extracted and half converting extracted to final output. That’s no surprise; the two CSV files are nearly identical. A whole lot of that time is spent deep in the bowels of Python’s CSV module reading rows. Again no surprise, but it confirms my suspicion that DictReader may be doing more work than we’d like.

For a shapefile source, we spend roughly 2/3 of the time using OGR to convert to CSV and 1/3 of the time converting the intermediate CSV to the final output. The OGR code is opaque, not clear how to figure out what it’s really spending time doing inside the C module.

Not clear what conclusions to draw here; there never is with profiling tools. I think my next step should be benchmarking Python’s CSV DictReader and seeing whether some simpler parsing would work significantly faster. I also think it’d be a huge improvement to remove that intermediate CSV file entirely, there’s a lot of overhead reading and writing it. It makes the code way simpler to work with but it should be possible to stream the data in memory and retain most of the same code model.

Not clear any of this optimization is worth the effort.