Some rough notes for optimizing openaddresses conform. These times are from a January 23 run of the Python code. The machine was busy running 8x jobs, so times may be a bit inflated from true, but it’s a start.
Here’s 3 sources that I know to be slow because of our own code. The time reported here are purely the time doing conform after the thing was downloaded.
- nl (csv source): 35 minutes, 14.8M rows
- dk (csv source): 43 minutes, 3.4M rows
- au-victoria (shapefile source): 46 minutes, 3.4M rows
- ??? ESRI source. No good examples; most of my conform code treats this as effectively CSV anyway, so going to ignore for now.
I just ran nl again and it took 31.5 minutes (26 minutes user, 5 minutes sys). Close enough, I’ll take these January 23 times as still indicative. At least for CSV sources.
Here’s some fast / short sources I can use for testing. These are total times including network.
- us-ca-palo_alto.json (csv source) 26 seconds
- ca-bc-north_cowichan.json (csv source) 24 seconds
- us-wa-chelan.json (shapefile source) 33 seconds
And here’s a report of top slow jobs that didn’t actually time out. Some of this slowness is due to network download time.
3521s us-va-city_of_chesapeake.json
2807s au-victoria.json
2765s us-ca-marin_county.json
2589s dk.json
2116s nl.json
2032s us-sc-aiken.json
1660s us-va-new_kent.json
1639s es-25830.json
1541s us-nc-alexander.json
1498s us-va.json
1367s us-va-fairfax.json
1352s us-sd.json
1345s us-ca-los_angeles_county.json
1325s us-mn-ramsey.json
1216s us-al-calhoun.json
1015s us-mi-kent.json
973s us-ms-hinds.json
955s us-wa-skagit.json
937s us-tn-rutherford.json
918s us-ca-solano_county.json
918s us-nc.json
786s us-fl-palm_beach_county.json
783s us-wa-seattle.json
776s us-wa-king.json
769s be-flanders.json
762s us-sc-laurens.json
729s us-wy-natrona.json
691s us-il-mchenry.json
682s us-tx-houston.json
678s us-al-montgomery.json
656s pl.json
grep Finished *.log | sort -nr -k 9 | cut -c 65- | sed ‘s%for /var/opt/openaddresses/sources/%%’ | head -50
Here’s some quicky cProfile output, sorted by cumulative time.
ca-bc-north_cowichan
python -m openaddr.conform ~/src/oa/profile/sources-fast/ca-bc-north_cowichan.json ~/src/oa/profile/caches/cowichan.csv /tmp/o/foo
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 6.053 6.053 <string>:1(<module>)
1 0.000 0.000 6.053 6.053 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
1 0.000 0.000 6.049 6.049 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
1 0.000 0.000 3.628 3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
1 0.042 0.042 3.628 3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:433(csv_source_to_csv)
30260 0.170 0.000 3.241 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
30260 0.163 0.000 2.958 0.000 /usr/lib/python2.7/csv.py:104(next)
30262 2.639 0.000 2.715 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
1 0.042 0.042 2.421 2.421 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
15129 0.066 0.000 1.328 0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:510(row_extract_and_reproject)
559802/15130 0.636 0.000 1.256 0.000 /usr/lib/python2.7/copy.py:145(deepcopy)
15132/15130 0.147 0.000 1.219 0.000 /usr/lib/python2.7/copy.py:253(_deepcopy_dict)
30260 0.028 0.000 0.822 0.000 /usr/lib/python2.7/csv.py:151(writerow)
30260 0.033 0.000 0.599 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
15129 0.026 0.000 0.571 0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
30262 0.085 0.000 0.402 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
332882 0.147 0.000 0.312 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
509313 0.236 0.000 0.309 0.000 /usr/lib/python2.7/copy.py:267(_keep_alive)
15129 0.025 0.000 0.256 0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
30260 0.170 0.000 0.196 0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)
us-wa-chelan
python -m openaddr.conform ~/src/oa/profile/sources-fast/us-wa-chelan.json ~/src/oa/profile/caches/chelan/*shp /tmp/o/foo
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 29.549 29.549 <string>:1(<module>)
1 0.000 0.000 29.549 29.549 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
1 0.000 0.000 29.545 29.545 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
1 0.000 0.000 19.640 19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
1 3.952 3.952 19.640 19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:343(ogr_source_to_csv)
1 0.163 0.163 9.903 9.903 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
44111 0.367 0.000 6.508 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
1764400 2.433 0.000 6.275 0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3012(GetField)
44111 0.324 0.000 5.837 0.000 /usr/lib/python2.7/csv.py:104(next)
44112 5.138 0.000 5.370 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
88222 0.095 0.000 3.969 0.000 /usr/lib/python2.7/csv.py:151(writerow)
88222 0.120 0.000 2.759 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
44110 0.078 0.000 2.512 0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
88224 0.450 0.000 1.791 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
972727 0.595 0.000 1.578 0.000 {method 'decode' of 'str' objects}
44110 0.078 0.000 1.487 0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
1764440 0.502 0.000 1.328 0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3183(GetFieldDefn)
2029152 0.652 0.000 1.325 0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
88222 0.933 0.000 1.114 0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)
7149945 1.114 0.000 1.114 0.000 {isinstance}
1764400 0.537 0.000 1.109 0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3478(GetNameRef)
1764400 0.440 0.000 0.988 0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2552(IsFieldSet)
972728 0.309 0.000 0.983 0.000 /usr/lib/python2.7/encodings/utf_8.py:15(decode)
1764400 0.504 0.000 0.960 0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2277(GetFieldCount)
44110 0.715 0.000 0.920 0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:592(<dictcomp>)
88222 0.848 0.000 0.848 0.000 {method 'writerow' of '_csv.writer' objects}
1764440 0.826 0.000 0.826 0.000 {_ogr.FeatureDefn_GetFieldDefn}
972727 0.258 0.000 0.782 0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2335(GetFieldAsString)
972728 0.674 0.000 0.674 0.000 {_codecs.utf_8_decode}
44111 0.026 0.000 0.636 0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:1190(GetNextFeature)
Profile conclusions
For a CSV source, we spend roughly half the time converting source CSV to extracted and half converting extracted to final output. That’s no surprise; the two CSV files are nearly identical. A whole lot of that time is spent deep in the bowels of Python’s CSV module reading rows. Again no surprise, but it confirms my suspicion that DictReader may be doing more work than we’d like.
For a shapefile source, we spend roughly 2/3 of the time using OGR to convert to CSV and 1/3 of the time converting the intermediate CSV to the final output. The OGR code is opaque, not clear how to figure out what it’s really spending time doing inside the C module.
Not clear what conclusions to draw here; there never is with profiling tools. I think my next step should be benchmarking Python’s CSV DictReader and seeing whether some simpler parsing would work significantly faster. I also think it’d be a huge improvement to remove that intermediate CSV file entirely, there’s a lot of overhead reading and writing it. It makes the code way simpler to work with but it should be possible to stream the data in memory and retain most of the same code model.
Not clear any of this optimization is worth the effort.