Python file reading benchmarks

After doing my CSV benchmarks yesterday I decided to do something simpler and just compare how long it takes to read files with and without Unicode decoding. My test file input is a 1.8G UTF-8 text file, the actual data is dk.csv, Danish street names. It’s exactly 1,852,070,317 bytes or 1,838,573,679 characters. The file has DOS newlines, so there’s both a 0x13 and a 0x10 at the end of every line. 33,445,793 lines total.

I didn’t use timeit for a proper microbenchmark, but I did warm the cache and make sure the times were repeatable. Some weirdness; sometimes the job takes 2x as long to run. Cannot explain that, maybe some L2 cache contention in a shared CPU? I took the faster timings.

Reading the whole file at once

  • Python 2, reading bytes: 0.5s
  • Python 2, reading unicode via codecs: 5.9s
  • Python 3, reading bytes: 0.5s
  • Python 3, reading unicode via codecs: 1.6s
  • Python 3, reading unicode via open(): 3.7s
  • Python 3, reading unicode via open with no newline conversion: 2.2s

Conclusions

  • Python 2 and Python 3 read bytes at the same speed
  • In Python 2, decoding Unicode is 10x slower than reading bytes
  • In Python 3, decoding Unicode is 3–7x slower than reading bytes
  • In Python 3, universal newline conversion is ~1.5x slower than skipping it, at least if the file has DOS newlines
  • In Python 3, codecs.open() is faster than open().

That last one about codecs.open is really weird. Codecs isn’t doing newline conversion, but even so it’s still 1.6s vs 2.2s. Maybe open() also does some other processing?

Reading the file a line at a time

  • Python 2, reading bytes: 0.6s
  • Python 2, reading unicode via codecs: 43s
  • Python 3, reading bytes: 1.0s
  • Python 3, reading unicode via codecs: 58s
  • Python 3, reading unicode via open(): 3.5s
  • Python 3, reading unicode via open with no newline conversion: 3.3s

Conclusions

  • Python 3 is ~1.7x little slower reading bytes line by line than Python 2
  • In Python 2, reading lines with Unicode is hella slow. About 7x slower than reading Unicode all at once. And Unicode lines are 70x slower than byte lines!
  • In Python 3, reading lines with Unicode is quite fast. About as fast as reading the file all at once. But only if you use the built-in open, not codecs.
  • In Python 3, codecs is really slow for reading line by line. Avoid.

Overall conclusion

Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.

Some of these benchmark numbers are surprising. My guess is that someone optimized Python 3 UTF-8 decoding very intensely, which explains why it’s better than Python 2. And the really terrible times like 40-50s are because some Python code (ie: not C) is manipulating the data, or else data structures are being mutated and it’s breaking CPU caches.

The code

"""Simple benchmark of reading data in Python 2 and Python 3,
comparing overhead of string decoding"""

import sys, time, codecs

fn = sys.argv[1]

def run(f):
    "Run a function, return time to execute it."
    # timeit is complex overkill
    s = time.time()
    f()
    e = time.time()
    return e-s

def readBytesAtOnce():
    fp = open(fn, 'rb')
    fp.read()

def codecsUnicodeAtOnce():
    fp = codecs.open(fn, encoding='utf-8')
    fp.read()

def py3UnicodeAtOnce():
    fp = open(fn, encoding='utf-8')
    fp.read()

def py3UnicodeAtOnceNoNewlineConversion():
    fp = open(fn, encoding='utf-8', newline = '')
    fp.read()


def readBytesByLine():
    for l in open(fn, 'rb'):
        pass

def codecsUnicodeByLine():
    for l in codecs.open(fn, encoding='utf-8'):
        pass

def py3UnicodeByLine():
    for l in open(fn, encoding='utf-8'):
        pass

def py3UnicodeNoNewlineConversionByLine():
    for l in open(fn, encoding='utf-8', newline = ''):
        pass

atOnce = [readBytesAtOnce, codecsUnicodeAtOnce]
if sys.version_info.major == 3:
    atOnce.append(py3UnicodeAtOnce)
    atOnce.append(py3UnicodeAtOnceNoNewlineConversion)

byLine = [readBytesByLine, codecsUnicodeByLine]
if sys.version_info.major == 3:
    byLine.append(py3UnicodeByLine)
    byLine.append(py3UnicodeNoNewlineConversionByLine)

open(fn).read()    # warm cache
for f in byLine:
    t = run(f)
    print('%7.2fs %s' % (t, f.__name__))

Python CSV benchmarks

I tested various ways of reading a CSV file in python, from simply reading the file line by line to using the full unicodecsv DictReader. Here’s what I discovered. Test data is dk.csv, a 3.4M row CSV file with 46 columns. (See also: file reading benchmarks.)

  • The Python2 csv module takes 2x longer than a naive split(‘,’) on every line
  • Python2 DictReader takes 2-3x longer than the simple csv reader that returns tuples
  • Python2 unicodecsv takes 5.5x longer than csv
  • Python3 csv takes 2-3x longer than Python2 csv. However it is Unicode-correct
  • Pandas in Python2 is about the same speed as DictReader, but is Unicode-correct.

I’m not sure why unicodecsv is so slow. I did a quick look in cProfile and all the time is being spent in next() where you’d expect. All those isinstance tests add significant time (20% or so) but that’s not the majority of the 5.5x slowdown. I guess string decoding is just a lot of overhead in Python2? It’s not trivial in Python3 either; I hadn’t realized how much slower string IO was in Py3. I wonder if there’s more going on. Anyway I filed an issue on unicodecsv asking about performance.

I’ve never used Pandas before. I ran into someone else saying unicodecsv is slow who switched to Pandas. It sure is fast! I think it’s a lot of optimized C code. But Pandas is a big package and has its own model of data and I don’t know that I want to buy into all of that. Its CSV module is nicely feature-rich though.

Not sure what conclusion to draw for OpenAddresses. I think we spend ~50% of our time just parsing CSV (for a CSV source like dk or nl). Switching from DictReader to regular reader() is the least pain. Concretely, for a 60 minute job that’d bring the time down to about 40–45 minutes. A nice improvement, but not life altering. Switching to Python3 so we no longer need unicodecsv would also save about the same amount of time.

Python 2 results

 0.17s catDevNull
 0.25s wc
 1.91s pythonRead
 0.58s pythonReadLine
 6.62s dumbCsv
 10.97s csvReader
 29.63s csvDictReader
 62.82s unicodeCsvReader
120.93s unicodeCsvDictReader
 27.93s pandasCsv

Python 3 results

 0.17s catDevNull
 0.25s wc
 5.18s pythonRead
 3.50s pythonReadLine
11.83s dumbCsv
27.77s csvReader
51.37s csvDictReader

The Code

# Benchmark various ways of reading a csv file
# Code works in Python2 and Python3

import sys, time, os, csv, timeit
try:
    import unicodecsv
except: 
    unicodecsv = None
try:
    import pandas
except:
    pandas = None

fn = sys.argv[1]

def warmCache():
    os.system('cat %s > /dev/null' % fn)

def catDevNull():
    os.system('cat %s > /dev/null' % fn)

def wc():
    os.system('wc -l %s > /dev/null' % fn)

def pythonRead():
    fp = open(fn)
    fp.read()

def pythonReadLine():
    fp = open(fn)
    for l in fp:
        pass

def csvReader():
    reader = csv.reader(open(fn, 'r'))
    for l in reader:
        pass

def unicodeCsvReader():
    reader = unicodecsv.reader(open(fn, 'r'))
    for l in reader:
        pass

def csvDictReader():
    reader = csv.DictReader(open(fn, 'r'))
    for l in reader:
        pass

def unicodeCsvDictReader():
    reader = unicodecsv.DictReader(open(fn, 'r'))
    for l in reader:
        pass

def dumbCsv():
    'Really simplistic CSV style parsing'
    fp = open(fn, 'r')
    for l in fp:
        d = l.split(',')

def pandasCsv():
    d = pandas.read_csv(fn, encoding='utf-8')
    # Ensure pandas really read the whole thing
    d.tail(1)

def run(f):
    "Run a function, return time to execute it."
    # timeit is complex overkill
    s = time.time()
    f()
    e = time.time()
    return e-s

warmCache()

functions = [catDevNull, wc, pythonRead, pythonReadLine, dumbCsv, csvReader, csvDictReader]
if unicodecsv:
    functions.append(unicodeCsvReader)
    functions.append(unicodeCsvDictReader)
if pandas:
    functions.append(pandasCsv)

for f in functions:
    t = run(f)
    print('%.2fs %s' % (t, f.__name__))

OpenAddresses optimization: some baseline timings

Some rough notes for optimizing openaddresses conform. These times are from a January 23 run of the Python code. The machine was busy running 8x jobs, so times may be a bit inflated from true, but it’s a start.

Here’s 3 sources that I know to be slow because of our own code. The time reported here are purely the time doing conform after the thing was downloaded.

  • nl (csv source): 35 minutes, 14.8M rows
  • dk (csv source): 43 minutes, 3.4M rows
  • au-victoria (shapefile source): 46 minutes, 3.4M rows
  • ??? ESRI source. No good examples; most of my conform code treats this as effectively CSV anyway, so going to ignore for now.

I just ran nl again and it took 31.5 minutes (26 minutes user, 5 minutes sys). Close enough, I’ll take these January 23 times as still indicative. At least for CSV sources.

Here’s some fast / short sources I can use for testing. These are total times including network.

  • us-ca-palo_alto.json (csv source) 26 seconds
  • ca-bc-north_cowichan.json (csv source) 24 seconds
  • us-wa-chelan.json (shapefile source) 33 seconds

And here’s a report of top slow jobs that didn’t actually time out. Some of this slowness is due to network download time.

3521s us-va-city_of_chesapeake.json
2807s au-victoria.json
2765s us-ca-marin_county.json
2589s dk.json
2116s nl.json
2032s us-sc-aiken.json
1660s us-va-new_kent.json
1639s es-25830.json
1541s us-nc-alexander.json
1498s us-va.json
1367s us-va-fairfax.json
1352s us-sd.json
1345s us-ca-los_angeles_county.json
1325s us-mn-ramsey.json
1216s us-al-calhoun.json
1015s us-mi-kent.json
973s us-ms-hinds.json
955s us-wa-skagit.json
937s us-tn-rutherford.json
918s us-ca-solano_county.json
918s us-nc.json
786s us-fl-palm_beach_county.json
783s us-wa-seattle.json
776s us-wa-king.json
769s be-flanders.json
762s us-sc-laurens.json
729s us-wy-natrona.json
691s us-il-mchenry.json
682s us-tx-houston.json
678s us-al-montgomery.json
656s pl.json

grep Finished *.log | sort -nr  -k 9 | cut -c 65- | sed ‘s%for /var/opt/openaddresses/sources/%%’ | head -50

Here’s some quicky cProfile output, sorted by cumulative time.

ca-bc-north_cowichan

python -m openaddr.conform ~/src/oa/profile/sources-fast/ca-bc-north_cowichan.json ~/src/oa/profile/caches/cowichan.csv /tmp/o/foo
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.053    6.053 <string>:1(<module>)
        1    0.000    0.000    6.053    6.053 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
        1    0.000    0.000    6.049    6.049 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
        1    0.000    0.000    3.628    3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
        1    0.042    0.042    3.628    3.628 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:433(csv_source_to_csv)
    30260    0.170    0.000    3.241    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
    30260    0.163    0.000    2.958    0.000 /usr/lib/python2.7/csv.py:104(next)
    30262    2.639    0.000    2.715    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
        1    0.042    0.042    2.421    2.421 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
    15129    0.066    0.000    1.328    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:510(row_extract_and_reproject)
559802/15130    0.636    0.000    1.256    0.000 /usr/lib/python2.7/copy.py:145(deepcopy)
15132/15130    0.147    0.000    1.219    0.000 /usr/lib/python2.7/copy.py:253(_deepcopy_dict)
    30260    0.028    0.000    0.822    0.000 /usr/lib/python2.7/csv.py:151(writerow)
    30260    0.033    0.000    0.599    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
    15129    0.026    0.000    0.571    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
    30262    0.085    0.000    0.402    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
   332882    0.147    0.000    0.312    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
   509313    0.236    0.000    0.309    0.000 /usr/lib/python2.7/copy.py:267(_keep_alive)
    15129    0.025    0.000    0.256    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
    30260    0.170    0.000    0.196    0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)

us-wa-chelan

python -m openaddr.conform ~/src/oa/profile/sources-fast/us-wa-chelan.json ~/src/oa/profile/caches/chelan/*shp /tmp/o/foo

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   29.549   29.549 <string>:1(<module>)
        1    0.000    0.000   29.549   29.549 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:723(main)
        1    0.000    0.000   29.545   29.545 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:700(conform_cli)
        1    0.000    0.000   19.640   19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:652(extract_to_source_csv)
        1    3.952    3.952   19.640   19.640 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:343(ogr_source_to_csv)
        1    0.163    0.163    9.903    9.903 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:678(transform_to_out_csv)
    44111    0.367    0.000    6.508    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:187(next)
  1764400    2.433    0.000    6.275    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3012(GetField)
    44111    0.324    0.000    5.837    0.000 /usr/lib/python2.7/csv.py:104(next)
    44112    5.138    0.000    5.370    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:105(next)
    88222    0.095    0.000    3.969    0.000 /usr/lib/python2.7/csv.py:151(writerow)
    88222    0.120    0.000    2.759    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:82(writerow)
    44110    0.078    0.000    2.512    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:558(row_transform_and_convert)
    88224    0.450    0.000    1.791    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:46(_stringify_list)
   972727    0.595    0.000    1.578    0.000 {method 'decode' of 'str' objects}
    44110    0.078    0.000    1.487    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:590(row_smash_case)
  1764440    0.502    0.000    1.328    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3183(GetFieldDefn)
  2029152    0.652    0.000    1.325    0.000 /usr/local/lib/python2.7/dist-packages/unicodecsv/__init__.py:35(_stringify)
    88222    0.933    0.000    1.114    0.000 /usr/lib/python2.7/csv.py:143(_dict_to_list)
  7149945    1.114    0.000    1.114    0.000 {isinstance}
  1764400    0.537    0.000    1.109    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:3478(GetNameRef)
  1764400    0.440    0.000    0.988    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2552(IsFieldSet)
   972728    0.309    0.000    0.983    0.000 /usr/lib/python2.7/encodings/utf_8.py:15(decode)
  1764400    0.504    0.000    0.960    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2277(GetFieldCount)
    44110    0.715    0.000    0.920    0.000 /home/nelson/src/oa/openaddresses-machine/openaddr/conform.py:592(<dictcomp>)
    88222    0.848    0.000    0.848    0.000 {method 'writerow' of '_csv.writer' objects}
  1764440    0.826    0.000    0.826    0.000 {_ogr.FeatureDefn_GetFieldDefn}
   972727    0.258    0.000    0.782    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:2335(GetFieldAsString)
   972728    0.674    0.000    0.674    0.000 {_codecs.utf_8_decode}
    44111    0.026    0.000    0.636    0.000 /usr/lib/python2.7/dist-packages/osgeo/ogr.py:1190(GetNextFeature)

Profile conclusions

For a CSV source, we spend roughly half the time converting source CSV to extracted and half converting extracted to final output. That’s no surprise; the two CSV files are nearly identical. A whole lot of that time is spent deep in the bowels of Python’s CSV module reading rows. Again no surprise, but it confirms my suspicion that DictReader may be doing more work than we’d like.

For a shapefile source, we spend roughly 2/3 of the time using OGR to convert to CSV and 1/3 of the time converting the intermediate CSV to the final output. The OGR code is opaque, not clear how to figure out what it’s really spending time doing inside the C module.

Not clear what conclusions to draw here; there never is with profiling tools. I think my next step should be benchmarking Python’s CSV DictReader and seeing whether some simpler parsing would work significantly faster. I also think it’d be a huge improvement to remove that intermediate CSV file entirely, there’s a lot of overhead reading and writing it. It makes the code way simpler to work with but it should be possible to stream the data in memory and retain most of the same code model.

Not clear any of this optimization is worth the effort.

OpenVPN in practice

My Tomato OpenVPN has served me well on this trip to India. Mostly using it to dodge the Taj Hotels’ hostile wifi, with random ports not routed and some porn filter that blocks any news site with embedded video as well as innocuous sites like reddit.com/r/mapporn.

Running the OpenVPN server on the HTTPS port was definitely the right choice.

It takes too long for Tunnelblick to negotiate a connection to my Tomato router’s OpenVPN server. Like 20+ seconds, often 30. Why isn’t it 1 second?

Setting this up on my home cable modem is a bit silly when I have a nice fast 100Mbit server in a colo. Need to learn to set up OpenVPN on Ubuntu. Now that I’ve done this with the Tomato server that seems easy.

Someone needs to make a privacy-only wrapper for OpenVPN, something just to set up VPN tunnels that route the whole Internet. The whole easy-rsa, cert management, etc process I’m doing now is way too complicated for the simple result I get.

iPad charging vs Air

I plugged my iPad into my Macbook Air to charge and it says “Not Charging”. WTF? Turns out that iPads (sometimes?) only charge when asleep, particularly when plugged in to computer USB ports. That leads to a sort of quantum paradox where the act of observing the thing charging stops it from charging. Stupid.

MacBook Air ports are apparently capable of providing the 2 amps of current an iPad wants to charge. The way to know for sure is to look for “Extra Operating Current” in the USB part of System Information. Oh except the Air won’t put the current out if it’s asleep.

More detail in this article. USB charging is too complicated.

Pretty memory graph

Kinda like how chaotic and pretty this graph is. It’s memory usage on my dev server for the past week, as I kept doing big OpenAddresses runs stressing the system. No huge insights, just looks neat.

memory-week

HooToo travel router

This router from HooToo is amazing: $18 for the HooToo Tripmate Nano. Ethernet in, 802.11 b/g/n out, and it acts like a router, bridge, can even serve SMB with a USB disk. And the thing is about 2×2 inches, less than half the size of a pack of cards. $18!

No idea how it works as a bulletproof serious router. I bought it for travel, in particular for hostile hotels that either have no WiFi, bad WiFi, or want to charge you per device. If they have an ethernet cable you’re usually better off than trying their wireless.

$18! How is that possible? Particularly the 802.11n chip. It’s an American company too, or at least they list an address in San Jose. (English is awfully bad on the site.) Hardware design is cheap, obviously, but seems solid enough.

The firmware is clearly some Linux derivative. The Nano just offers a binary, which I unpacked as far as to see it was an initrd image. Another product of theirs includes a “GPL Source download” so they’ve at least heard of open source license requirements.