installing openaddresses-machine in a virtualenv

Trying to get a server up running OpenAddresses in a controlled environment. I’m normally very sloppy and install everything as root in /usr, but I figured I should use virtualenv this time. Machine actually has a very nice Chef setup and bootstraps itself regularly in EC2 and Travis contexts, but I’m stubborn and doing it my own way.

Install openaddresses-machine and it’s simple dependencies

  • Prepare a virtualenv directory and activate it
  • pip install Openaddresses-Machine

Install cairocffi. This required a C library I didn’t have

  • sudo apt-get install libffi-dev
  • pip install cairocffi

Install Python GDAL. This is a mess; I’m not sure why the simple “pip install GDAL” doesn’t work from inside the virtualenv. And I’m not sure the instructions below are correct; it’s probably installing python-gdal globally on the system via and then again in the virtualenv via pip. But that gets all the C dependencies we need somewhere on the system. There’s extra rigamarole to get the bleeding edge GDAL instead of the stock Ubuntu GDAL. Also building GDAL requires a C++ compiler.

  • apt-get install software-properties-common
  • sudo apt-add-repository ppa:ubuntugis/ubuntugis-unstable
  • sudo apt-get update
  • sudo apt-get install python-gdal libgdal-dev
  • sudo apt-get install g++
  • pip install GDAL
  • PSYCH! That won’t work. Follow the instructions in this gist for how to manually configure and install gdal. Apparently its packaging is not compatible with pip?

Logging out of Google

Logging out of Google is a funny process; you get redirected to a bunch of sites one at a time, presumably clearing cookies in different domains. youtube.com, blogger.com, and of course google.com. Here’s the URL for google.com, broken down by CGI parameters

http://www.google.com/accounts/Logout2?ilo=1
ils=adwords,ah,cl,doritos,lso,mail,orkut,sierra,writely,o.mail.google.com,o.myaccount.google.com,s.blogger,blogger
ilc=1
continue=https%3A%2F%2Faccounts.google.com%2FServiceLogin
zx=-910194353

I find that chain of ils parameters interesting. It’s an odd set of properties, including some like Orkut for which I have no account. I wonder if it’s a list of Google services that maintain their own authentication tokens?

I’m having a problem where all Google services are super-slow for me, randomly. Not just Gmail but also even a Google search. Doesn’t happen in incognito mode. My theory was it was related to being logged-in, but logged out didn’t fix it. Being in an incognito window does fix it. Hrm. (Update: logging out does seem to have fixed it afterall, and now that I’m logged in again I’m fine. Perhaps my Google session got stuck on a broken server?)

Some OpenAddresses machine stats

I’ve collected the state.txt files written by OpenAddress machine and stuffed them in a database for ad hoc reporting. Here’s some queries and results

Slowest processing

select source, max(address_count), round(avg(process_time)) as time
from stats where address_count > 0
group by source order by time desc limit 10;

source                          max(addres  time
------------------------------  ----------  ----------
nl.json                         14846597    8769.0
au-victoria.json                3426983     8438.0
es-25830.json                   8778892     6450.0
us-va.json                      3504160     4377.0
dk.json                         3450853     4220.0
pl-mazowieckie.json             995946      4108.0
us-ny.json                      4229542     4047.0
pl-lodzkie.json                 685093      3575.0
pl-slaskie.json                 575596      3331.0
pl-malopolski.json              670892      3216.0

Slowest cache

source                          max(addres  time
------------------------------  ----------  ----------
us-ny.json                      4229542     5808.0
za-nl-ethekwini.json            478939      3617.0
us-va-city_of_chesapeake.json   99197       3457.0
us-ca-marin_county.json         104067      3034.0
us-mi-ottawa.json               109202      2232.0
us-fl-lee.json                  548264      1999.0
us-ne-lancaster.json            97859       1995.0
us-ia-linn.json                 95928       1519.0
dk.json                         3450853     1348.0
us-az-mesa.json                 331816      890.0

Summary of runs

For each run, print out how many sources we tried, how many cached, how many sampled, and how many processed. The counting here is a bit bogus but I think close.

select ts, count(distinct(source)), count(distinct(cache_url)), count(distinct(sample_url)), count(distinct(processed_url))
from stats group by ts;
ts               addresses        source  caches  sample  proces
---------------  ---------------  ------  ------  ------  ------
1420182459                        737     689     557     490
1420528178                        738     691     559     490
1420787323                        740     692     560     495
1421133148                        789     717     558     521
1421392045                        790     718     559     523
1421737732                        790     719     557     521
1421996867                        790     719     558     522
1422342479                        790     718     560     524
1422601615                        790     652     582     503
1422645440                        790     655     585     511
1422748762                        790     650     580     504
1422773454                        790     647     577     506
1422860989                        790     653     582     508
1423034051                        790     658     588     509
1423206384                        790     658     587     508
1423465798                        790     660     588     510
1423638409                        790     659     588     508
1423811369                        790     661     590     508
1424070358                        790     657     586     509
1424243321                        790     645     566     499
1424416039                        790     650     534     469
1424675318        88031206        790     648     563     496
1424849197       111999281        794     658     570     505
1425022188       105056481        794     653     570     508
1425280104       109864743        794     651     567     507
1425453036       111734276        786     655     573     520
1425625770       116323932        786     656     571     519
1425881358       113713353        786     644     563     528
1426054556       117771563        788     666     586     534
1426228915       117335107        788     666     584     538
1427090950       113916633        788     665     581     527

Time to complete successful runs

 select ts, round(avg(cache_time)) as cache, round(avg(process_time)) as process 
 from stats where length(processed_url) > 1 
 group by ts order by ts;
ts               cache            proces
---------------  ---------------  ------
1420182459       63.0             154.0
1420528178       50.0             147.0
1420787323       46.0             166.0
1421133148       43.0             159.0
1421392045       43.0             160.0
1421737732       43.0             163.0
1421996867       45.0             167.0
1422342479       46.0             164.0
1422601615       48.0             228.0
1422645440       56.0             314.0
1422748762       47.0             300.0
1422773454       50.0             289.0
1422860989       61.0             296.0
1423034051       66.0             263.0
1423206384       44.0             260.0
1423465798       51.0             294.0
1423638409       51.0             270.0
1423811369       50.0             260.0
1424070358       55.0             277.0
1424243321       60.0             265.0
1424416039       57.0             298.0
1424675318       51.0             265.0
1424849197       62.0             323.0
1425022188       53.0             284.0
1425280104       59.0             312.0
1425453036       55.0             325.0
1425625770       79.0             367.0
1425881358       77.0             315.0
1426054556       84.0             335.0
1426228915       76.0             361.0
1427090950       71.0             272.0

Most improved

A count of which sources have the most variance in how many address lines we got out. kinda bogus, but fun.

select source, max(address_count)-min(address_count) as diff, min(address_count) as minc, max(address_count) as maxc
from stats where address_count > 0 
group by source order by diff desc limit 15;

source                          diff        minc        maxc
------------------------------  ----------  ----------  ----------
us-mn-dakota.json               138301      27218       165519
us-pa-allegheny.json            53399       474492      527891
us-ne-lancaster.json            44996       52863       97859
us-va-salem.json                9992        614         10606
dk.json                         5258        3445595     3450853
be-flanders.json                2809        3365557     3368366
us-nc-mecklenburg.json          2117        499524      501641
us-co-larimer.json              1118        169500      170618
ca-on-hamilton.json             1112        243554      244666
us-va-james_city.json           1071        33383       34454
us-tx-austin.json               837         374854      375691
us-or-portland.json             786         415220      416006
us-nc-buncombe.json             784         147492      148276
us-nc-union.json                720         81241       81961
us-oh-williams.json             603         18496       19099

Bandwidth caps

My fast cable modem service comes with a data cap. I get an honest 100 Mbit/s, with a 300 Gbyte/month cap. Beyond that it costs $0.20 per gigabyte. I hate the idea of data caps although I sort of understand where the ISP is coming from. If I hit this cap every month I’d be angry, but so far I’ve usually come in far under. Not this month though; between the new PS4 and downloading a bunch of video I’m well over. 360GB and there’s 10 days to go.

Being over the cap has really changed how I think about the Internet. I’m now pricing stuff in my head. Watching a single LoL game broadcast will cost me $0.20. Watching a high quality movie on a stream will cost me $1.00. Do I really need to download a whole season of that TV show I might or might not watch? Maybe I can wait a few days, until April? Maybe I should put off some software updates too. I mean truthfully it’s not a lot of money, and I can certainly afford it, but knowing that I’m paying by the GB changes the way I think about it. (Related: this is why micropayments are a bad idea.)

I’m a bit confused about how my ISP calculates my usage. I also track usage via my router, and the GB usage numbers are wildly divergent.

Month    ISP    Router   Percentage
Jan      167    199       84%
Feb      160    148      108%
Mar      360    235      153%

I certainly understand why the numbers wouldn’t be exactly the same, but it’s been ±15% for the last two months. This month it’s +50%. I may ask about that when the month is over, although I’m pessimistic that customer service will give me a useful answer.

If I maxed my connection I could download about 30,000 GB a month; so my cap is 1% of max. My slow rural ISP has 1/100th the speed but the cap is 1/10th the size, so their cap is 0.1% of max.

NTP kibbitzing

This Metafilter discussion about the ntpd developer needing funding got me thinking more about NTP. I wrote a few comments on MeFi that have good content.

But the big revelation for me is NTimed, PHK’s project to write an ntpd replacement. His notes are very encouraging. The big excitement is that he’s explicitly building a client-only version of the code for 99.99% of users, and he thinks he can keep it to < 10,000 lines of code. That would be fantastic, both for security and for performance. I hope it’s a success. So far it looks very promising.

unicodecsv 0.11.0 speed improvement

My little Python csv parsing benchmark had one really nice effect; someone contributed a patch to remove some unnecessary calls to isinstance in certain circumstances. He claims a 2x speedup. That patch was just released in 0.11.0, so I tested it. These are the same benchmarks and data file I ran and reported before as “Python 2 results”

 61.65s unicodeCsvReader 0.9
118.64s unicodeCsvDictReader 0.9
 41.12s unicodeCsvReader 0.11
 96.34s unicodeCsvDictReader 0.11

Not quite a 2x speedup, but runs in 0.7x or 0.8x the time it used to. That’s quite good for a simple change, and a nice example of open source working well.

I should really stop using the csv DictReader. Or maybe make a new one that’s smarter. The current Python module seems to actually create a real dict object for every row. I think you could make something faster that used a class wrapper for the tuple that emulated dict-style retrieval by looking up column offsets in the row header. At least there’d be less data copied, but maybe the wrapper overhead would negate that. Have to try it to see.

Modern consoles vs bandwidth

I just got a Playstation 4. Sony seems to have dropped any rules related to game update download sizes. Newish games I bought on disc like Shadows of Mordor or Dragon Age: Inquisition require downloading 8GB “game updates” before they’re playable. Sometimes on the first day the game comes out! And multiplayer titles like Destiny are even worse. 16GB to download the game, another 8GB + 16GB of downloads if I recall correctly.

I started setting up this PS4 and 4 hours later, it’s downloaded 80GB. That’s to install 2 games I bought on disc, 3 games from download, and 2 downloaded demos. I’m just grateful I had the foresight to set this up before hauling the console up to Grass Valley and the 1Mbps Internet there. An 8GB download will take literally 24 hours to download there. Also I have a 30GB cap, so let’s hope updates aren’t frequent.