Second full OpenAddress run

After the first run against 600+ sources we patched up a bunch of bugs, improved the running infrastructure. Then I ran another set last yesterday, amended a little bit today to re-run jp-* and kr-* after one more bug fix. I wasn’t quite as careful in measuring everything, but here’s some notes on output.

  • 579 input sources: 484 out.csv files, 483 sample.json files
    84% success rate; compare 66% from previous run.
  • 102M output rows.
    Compare 80M from previous run, 100M from Node
  • Roughly 3.5 hours running time, maybe less.
    3 threads stalled on bad servers so it’s not a good measure.

Throwing out the 3 stalled runs, the average source took 137s to execute. Standard deviation of 585s, so it’s a very broad distribution. That first run averaged 104s (SD 265). But a lot more of those runs failed in 1 second!

I’m curious if the new ESRI code is faster or slower than the old code. It’s definitely better in that it works with many more sources. Don’t really have data for it. The ESRI time is probably dominated by slow servers anyway.

Here’s the slowest non-ESRI sources. Given that these times include download times from some slow servers I’m pretty OK with them.

dk            CSV   2000s     3.4M rows
au-victoria   SHP   1620s     3.4M rows
es-25830      CSV   1173s     8.8M rows
nl            CSV   1156     14.8M rows

95 sources failed to produce an out.csv. Of those, 7 did succeed last time we ran, a potential regression. Here’s the cause of each failure:

  • us-il-tazewell: ESRI source, JSON parsing error after many calls
  • us-tx-colleyville us-tx-dallas us-tx-hurst: ESRI sources, found no records
  • us-nc: missing “file” attribute for multiple shapefiles
  • us-dc: bad download
  • us-ca-san_diego: bad zip download

And finally here’s the list of all 95 sources that didn’t produce an out.csv. I should go through and hand-classify the failures again, but I bet most of 88 of them are for the same reason as last time.

au-queensland
be-flanders
ca-ab-calgary
ca-ab-strathcona-county
ca-bc-kelowna
ca-bc-langley
ca-bc-nanaimo
ca-bc-okanagan_similkameen
ca-bc-surrey
ca-bc-west_kelowna
ca-ns-halifax
ca-on-niagra
ca-sk-regina
us-al-shelby
us-ar
us-ca-san_diego
us-co-sanmiguel
us-ct-avon
us-dc
us-fl-alachua
us-fl-collier
us-ga-glynn
us-ia-linn
us-ia-polk
us-id-canyon
us-il-tazewell
us-in-hamilton
us-in-madison
us-in-st_joseph
us-la-st_james
us-ma
us-mi-muskegon
us-mn-dakota
us-mn-metrogis
us-mn-polk
us-mn-pope
us-mn-yellow_medicine
us-mo-barry
us-mo-st_louis_county
us-ms-madison
us-nc
us-nc-1
us-nc-10
us-nc-2
us-nc-3
us-nc-4
us-nc-5
us-nc-6
us-nc-7
us-nc-8
us-nc-9
us-nc-charlotte
us-nc-columbus
us-nc-davie
us-nc-wake_county
us-ne-omaha
us-nm-san_juan
us-nv-henderson
us-nv-lander
us-nv-nye
us-nv-washoe_county
us-oh-clinton
us-oh-hamilton
us-pa-beaver
us-ri
us-tn-memphis
us-tx-colleyville
us-tx-dallas
us-tx-denton
us-tx-el_paso
us-tx-hurst
us-tx-keller
us-tx-north_richland_hills
us-tx-round_rock
us-va-alexandria
us-va-augusta
us-va-city_of_falls_church
us-va-city_of_petersburg
us-va-richmond_city
us-va-roanoke
us-va-stafford
us-wa-san_juan
us-wa-snohmish
us-wi-adams
us-wi-calumet
us-wi-crawford
us-wi-dodge
us-wi-jefferson
us-wi-juneau
us-wi-lincoln
us-wi-oneida
us-wi-richland
us-wi-sauk
us-wi-superior
us-wy-laramie