Installing openaddress-machine on a new EC2 system using Chef

No need to install stuff manually; Mike already wrapped up scripts to set up an EC2 system with Chef for us. Here’s how to use it on a brand new EC2 micro server

sudo bash
apt-get update
apt-get upgrade
apt-get install git
git clone https://github.com/openaddresses/machine.git
cd machine/chef
./run.sh

Done! The shell command openaddr-process-one now works and does stuff.

In brief, this:

  1. installs Chef and Ruby via apt
  2. runs a Python setup recipe. That installs a few Ubuntu Python packages with apt (including GDAL and Cairo), then does a “pip install” in the OpenAddress machine directory. This tells pip to install a bunch of other Python stuff we use.
  3. runs a recipe for OpenAddresses. This uses git to put the source JSON data files in /var/opt.

Update

But really, that’s so manual. If you just pip install openaddr-machine it makes a /usr/local/bin/openaddr-ec2-run script that will do the work for you. That in turn invokes a run.py script which you run on your local machine. It, among other things, runs a templated shell script to set up an EC2 instance and run the job on it.

The shell script that is run on EC2 is pretty basic. It:

  1. Updates apt (but does not upgrade)
  2. Installs git and apache2
  3. clones the openaddress-machine repo into /tmp/machine
  4. Runs scripts to setup swap on the machine, then invoke chef to set up the machine
  5. Runs openaddr-process to do the job
  6. Shuts the machine down.

The run.py script you use on your own machine is mostly about getting an EC2 instance.

  1. De-template the shell script and put it in user_data.
  2. Use boto.ec2 to bid on a spot instance
  3. Wait for up to 12 hours until we get our instance

The details of how the EC2 instance is bid for, created, and waited on are a bit funky but seem well contained.

Installing openaddresses-machine on a new EC2 instance

Just documenting some work I done. The pattern here follows my notes on making openaddresses work in virtualenv, although I didn’t use venv here. This is really how to get the minimal stuff installed on a new Ubuntu box to run our code. Just a bunch of shell commands.

Set up an Ubuntu 14.04 AMI. The smallest one will run it OK

Do the following as root:

# update  ubuntu
apt-get update
apt-get upgrade

# install pip
cd /tmp
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

# install gcc and python dev
apt-get install gcc g++ python-dev

# install openaddresses
pip install Openaddresses-Machine

# install Cairo
apt-get install libffi-dev libcairo2
pip install cairocffi

# install GDAL
apt-add-repository ppa:ubuntugis/ubuntugis-unstable
apt-get update
apt-get install python-gdal libgdal-dev
pip install GDAL

I haven’t run these commands a second time yet, but it should be close. The last line for “pip install GDAL” is probably not necessary, and it’d probably be better to install Openaddresses-Machine last although it may not matter.

Next step: automate this with Chef Solo.

After that: set up Honcho and a Procfile to run my queue service worker.

OpenAddress machine continuous queue

Migurski and I have been working up a new architecture for OpenAddresses, specifically the python code that runs source jobs. Currently everything runs as a big batch job every two days. The primary goal is to get a system that can run a new submitted source instantly, ideally with useful feedback to the submitter.

Mike’s insight was to build a system a little like Travis; something that looks at GitHub commits with a hook, runs a new job every time something is committed, and posts back info via the GitHub status API. GitHub is nice because it provides a file store and change tracking. Ultimately we don’t want to require a GitHub login to submit sources, and we probably want a custom UI, but that can come later.

Here’s the components we’re talking about building:

  1. A GitHub webhook that submits info to our new job server when stuff is checked in.
  2. A new job server that accepts HTTP posts of GitHub events. Mike’s got one working in Flask running on Heroku.
  3. A job queue. New jobs get enqueued by the new job server, then some worker daemon process checks for jobs to run and executes them. Can be a very simple queue, but needs to support job priorities (interactive vs. batch). Right now I’m thinking of writing my own using Postgres running in Amazon RDS, but there’s got to be something I can just use that’s not super overkill. (Update: pq looks good.)
  4. Workers to run jobs. Something to look for new jobs and execute them. Current thought is a persistent Unix server running on EC2. Installs openaddr and its dependencies via pip, then runs jobs via openaddr-process-one, then posts results back to the job queue result queue.
  5. Job completion notifications back to GitHub using the GitHub status API. Mike has been tinkering with code to do this already.
  6. A batch-mode refresher. Something to run sources periodically even if no one’s edited the source specification in awhile. Mostly to catch new data on the remote server.

Here’s some open things to design.

  1. Job status reporting with humane UI. Currently we get very little info back from when a job is run. A few bits of status (did it complete, how long it took), a debug log from the process, and any actual output files. We need more. In particular I think the stuff you currently learn from the debug log needs to be expanded to useful user-level event data. Ie, events like “source JSON parsed”, “source data downloaded”, etc.
  2. UI for creating source specification JSON files. Imagining some guided wizard like UI using Javascript and forms. There’s already a pretty good one on the live site, just needs some expansion. Also ultimately we shouldn’t require a GitHub login.
  3. Security. A bit nervous running completely unreviewed source specs from anonymous people. Isolating the worker in a throwaway machine might help. Or doing some sanitization on inputs.
  4. Statistics reporting. I have a dashboard that works OK for the existing nightly batch mode, but we’ll want something different for this continuous mode thing.

See also OpenAddresses Mark 3, a proposal I sketched in January. What we’re doing here is similar but without the component of breaking any individual source run up into separate tasks.

Python3 est arrivé

I’m now using Python3 by default for all new projects. Almost entirely because I prefer sane Unicode handling. But really because Py3 has turned the corner and is now useful for most things. Probably it turned that corner a year or more ago, but it was the OpenAddresses work that made me start using Python 3 regularly. It helps that Homebrew and Debian both make it very easy to install ‘python3’ and ‘pip3’. And of course that most packages I care about are now ported over.

It took Python five years longer than I expected for the Py3 transition to get to this point. Let’s hope we don’t have to do it again.

Address schema

Some stuff of possible interest to OpenAddresses. The US government’s Federal Geographic Data Committee standardized street address schema. The docs are 140 pages of text describing an XML schema, which sure seems wordy, but then it’s also very precise. They even put the XML schema online, here’s a text rendering.

It’s super wordy and contains a lot of detail, but at the core is a schema we could repurpose that says things like “a street address consists of a number, a prefix, a street name, …”

There’s a good summary of the schema on page 3 of the executive summary. Here’s the meat of it:

Screen Shot 2015-04-09 at 10.57.41 AM

installing openaddresses-machine in a virtualenv

Trying to get a server up running OpenAddresses in a controlled environment. I’m normally very sloppy and install everything as root in /usr, but I figured I should use virtualenv this time. Machine actually has a very nice Chef setup and bootstraps itself regularly in EC2 and Travis contexts, but I’m stubborn and doing it my own way.

Install openaddresses-machine and it’s simple dependencies

  • Prepare a virtualenv directory and activate it
  • pip install Openaddresses-Machine

Install cairocffi. This required a C library I didn’t have

  • sudo apt-get install libffi-dev
  • pip install cairocffi

Install Python GDAL. This is a mess; I’m not sure why the simple “pip install GDAL” doesn’t work from inside the virtualenv. And I’m not sure the instructions below are correct; it’s probably installing python-gdal globally on the system via and then again in the virtualenv via pip. But that gets all the C dependencies we need somewhere on the system. There’s extra rigamarole to get the bleeding edge GDAL instead of the stock Ubuntu GDAL. Also building GDAL requires a C++ compiler.

  • apt-get install software-properties-common
  • sudo apt-add-repository ppa:ubuntugis/ubuntugis-unstable
  • sudo apt-get update
  • sudo apt-get install python-gdal libgdal-dev
  • sudo apt-get install g++
  • pip install GDAL
  • PSYCH! That won’t work. Follow the instructions in this gist for how to manually configure and install gdal. Apparently its packaging is not compatible with pip?

Some OpenAddresses machine stats

I’ve collected the state.txt files written by OpenAddress machine and stuffed them in a database for ad hoc reporting. Here’s some queries and results

Slowest processing

select source, max(address_count), round(avg(process_time)) as time
from stats where address_count > 0
group by source order by time desc limit 10;

source                          max(addres  time
------------------------------  ----------  ----------
nl.json                         14846597    8769.0
au-victoria.json                3426983     8438.0
es-25830.json                   8778892     6450.0
us-va.json                      3504160     4377.0
dk.json                         3450853     4220.0
pl-mazowieckie.json             995946      4108.0
us-ny.json                      4229542     4047.0
pl-lodzkie.json                 685093      3575.0
pl-slaskie.json                 575596      3331.0
pl-malopolski.json              670892      3216.0

Slowest cache

source                          max(addres  time
------------------------------  ----------  ----------
us-ny.json                      4229542     5808.0
za-nl-ethekwini.json            478939      3617.0
us-va-city_of_chesapeake.json   99197       3457.0
us-ca-marin_county.json         104067      3034.0
us-mi-ottawa.json               109202      2232.0
us-fl-lee.json                  548264      1999.0
us-ne-lancaster.json            97859       1995.0
us-ia-linn.json                 95928       1519.0
dk.json                         3450853     1348.0
us-az-mesa.json                 331816      890.0

Summary of runs

For each run, print out how many sources we tried, how many cached, how many sampled, and how many processed. The counting here is a bit bogus but I think close.

select ts, count(distinct(source)), count(distinct(cache_url)), count(distinct(sample_url)), count(distinct(processed_url))
from stats group by ts;
ts               addresses        source  caches  sample  proces
---------------  ---------------  ------  ------  ------  ------
1420182459                        737     689     557     490
1420528178                        738     691     559     490
1420787323                        740     692     560     495
1421133148                        789     717     558     521
1421392045                        790     718     559     523
1421737732                        790     719     557     521
1421996867                        790     719     558     522
1422342479                        790     718     560     524
1422601615                        790     652     582     503
1422645440                        790     655     585     511
1422748762                        790     650     580     504
1422773454                        790     647     577     506
1422860989                        790     653     582     508
1423034051                        790     658     588     509
1423206384                        790     658     587     508
1423465798                        790     660     588     510
1423638409                        790     659     588     508
1423811369                        790     661     590     508
1424070358                        790     657     586     509
1424243321                        790     645     566     499
1424416039                        790     650     534     469
1424675318        88031206        790     648     563     496
1424849197       111999281        794     658     570     505
1425022188       105056481        794     653     570     508
1425280104       109864743        794     651     567     507
1425453036       111734276        786     655     573     520
1425625770       116323932        786     656     571     519
1425881358       113713353        786     644     563     528
1426054556       117771563        788     666     586     534
1426228915       117335107        788     666     584     538
1427090950       113916633        788     665     581     527

Time to complete successful runs

 select ts, round(avg(cache_time)) as cache, round(avg(process_time)) as process 
 from stats where length(processed_url) > 1 
 group by ts order by ts;
ts               cache            proces
---------------  ---------------  ------
1420182459       63.0             154.0
1420528178       50.0             147.0
1420787323       46.0             166.0
1421133148       43.0             159.0
1421392045       43.0             160.0
1421737732       43.0             163.0
1421996867       45.0             167.0
1422342479       46.0             164.0
1422601615       48.0             228.0
1422645440       56.0             314.0
1422748762       47.0             300.0
1422773454       50.0             289.0
1422860989       61.0             296.0
1423034051       66.0             263.0
1423206384       44.0             260.0
1423465798       51.0             294.0
1423638409       51.0             270.0
1423811369       50.0             260.0
1424070358       55.0             277.0
1424243321       60.0             265.0
1424416039       57.0             298.0
1424675318       51.0             265.0
1424849197       62.0             323.0
1425022188       53.0             284.0
1425280104       59.0             312.0
1425453036       55.0             325.0
1425625770       79.0             367.0
1425881358       77.0             315.0
1426054556       84.0             335.0
1426228915       76.0             361.0
1427090950       71.0             272.0

Most improved

A count of which sources have the most variance in how many address lines we got out. kinda bogus, but fun.

select source, max(address_count)-min(address_count) as diff, min(address_count) as minc, max(address_count) as maxc
from stats where address_count > 0 
group by source order by diff desc limit 15;

source                          diff        minc        maxc
------------------------------  ----------  ----------  ----------
us-mn-dakota.json               138301      27218       165519
us-pa-allegheny.json            53399       474492      527891
us-ne-lancaster.json            44996       52863       97859
us-va-salem.json                9992        614         10606
dk.json                         5258        3445595     3450853
be-flanders.json                2809        3365557     3368366
us-nc-mecklenburg.json          2117        499524      501641
us-co-larimer.json              1118        169500      170618
ca-on-hamilton.json             1112        243554      244666
us-va-james_city.json           1071        33383       34454
us-tx-austin.json               837         374854      375691
us-or-portland.json             786         415220      416006
us-nc-buncombe.json             784         147492      148276
us-nc-union.json                720         81241       81961
us-oh-williams.json             603         18496       19099