Nelson's log

Python file reading benchmarks

After doing my CSV benchmarks yesterday I decided to do something simpler and just compare how long it takes to read files with and without Unicode decoding. My test file input is a 1.8G UTF-8 text file, the actual data is dk.csv, Danish street names. It’s exactly 1,852,070,317 bytes or 1,838,573,679 characters. The file has DOS newlines, so there’s both a 0x13 and a 0x10 at the end of every line. 33,445,793 lines total.

I didn’t use timeit for a proper microbenchmark, but I did warm the cache and make sure the times were repeatable. Some weirdness; sometimes the job takes 2x as long to run. Cannot explain that, maybe some L2 cache contention in a shared CPU? I took the faster timings.

Reading the whole file at once

Conclusions

That last one about codecs.open is really weird. Codecs isn’t doing newline conversion, but even so it’s still 1.6s vs 2.2s. Maybe open() also does some other processing?

Reading the file a line at a time

Conclusions

Overall conclusion

Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.

Some of these benchmark numbers are surprising. My guess is that someone optimized Python 3 UTF-8 decoding very intensely, which explains why it’s better than Python 2. And the really terrible times like 40-50s are because some Python code (ie: not C) is manipulating the data, or else data structures are being mutated and it’s breaking CPU caches.

The code


"""Simple benchmark of reading data in Python 2 and Python 3,
comparing overhead of string decoding"""

import sys, time, codecs

fn = sys.argv[1]

def run(f):
    "Run a function, return time to execute it."
    # timeit is complex overkill
    s = time.time()
    f()
    e = time.time()
    return e-s

def readBytesAtOnce():
    fp = open(fn, 'rb')
    fp.read()

def codecsUnicodeAtOnce():
    fp = codecs.open(fn, encoding='utf-8')
    fp.read()

def py3UnicodeAtOnce():
    fp = open(fn, encoding='utf-8')
    fp.read()

def py3UnicodeAtOnceNoNewlineConversion():
    fp = open(fn, encoding='utf-8', newline = '')
    fp.read()


def readBytesByLine():
    for l in open(fn, 'rb'):
        pass

def codecsUnicodeByLine():
    for l in codecs.open(fn, encoding='utf-8'):
        pass

def py3UnicodeByLine():
    for l in open(fn, encoding='utf-8'):
        pass

def py3UnicodeNoNewlineConversionByLine():
    for l in open(fn, encoding='utf-8', newline = ''):
        pass

atOnce = [readBytesAtOnce, codecsUnicodeAtOnce]
if sys.version_info.major == 3:
    atOnce.append(py3UnicodeAtOnce)
    atOnce.append(py3UnicodeAtOnceNoNewlineConversion)

byLine = [readBytesByLine, codecsUnicodeByLine]
if sys.version_info.major == 3:
    byLine.append(py3UnicodeByLine)
    byLine.append(py3UnicodeNoNewlineConversionByLine)

open(fn).read()    # warm cache
for f in byLine:
    t = run(f)
    print('%7.2fs %s' % (t, f.__name__))