After doing my CSV benchmarks yesterday I decided to do something simpler and just compare how long it takes to read files with and without Unicode decoding. My test file input is a 1.8G UTF-8 text file, the actual data is dk.csv, Danish street names. It’s exactly 1,852,070,317 bytes or 1,838,573,679 characters. The file has DOS newlines, so there’s both a 0x13 and a 0x10 at the end of every line. 33,445,793 lines total.
I didn’t use timeit for a proper microbenchmark, but I did warm the cache and make sure the times were repeatable. Some weirdness; sometimes the job takes 2x as long to run. Cannot explain that, maybe some L2 cache contention in a shared CPU? I took the faster timings.
Reading the whole file at once
- Python 2, reading bytes: 0.5s
- Python 2, reading unicode via codecs: 5.9s
- Python 3, reading bytes: 0.5s
- Python 3, reading unicode via codecs: 1.6s
- Python 3, reading unicode via open(): 3.7s
- Python 3, reading unicode via open with no newline conversion: 2.2s
- Python 2 and Python 3 read bytes at the same speed
- In Python 2, decoding Unicode is 10x slower than reading bytes
- In Python 3, decoding Unicode is 3–7x slower than reading bytes
- In Python 3, universal newline conversion is ~1.5x slower than skipping it, at least if the file has DOS newlines
- In Python 3, codecs.open() is faster than open().
That last one about codecs.open is really weird. Codecs isn’t doing newline conversion, but even so it’s still 1.6s vs 2.2s. Maybe open() also does some other processing?
Reading the file a line at a time
- Python 2, reading bytes: 0.6s
- Python 2, reading unicode via codecs: 43s
- Python 3, reading bytes: 1.0s
- Python 3, reading unicode via codecs: 58s
- Python 3, reading unicode via open(): 3.5s
- Python 3, reading unicode via open with no newline conversion: 3.3s
- Python 3 is ~1.7x little slower reading bytes line by line than Python 2
- In Python 2, reading lines with Unicode is hella slow. About 7x slower than reading Unicode all at once. And Unicode lines are 70x slower than byte lines!
- In Python 3, reading lines with Unicode is quite fast. About as fast as reading the file all at once. But only if you use the built-in open, not codecs.
- In Python 3, codecs is really slow for reading line by line. Avoid.
Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.
Some of these benchmark numbers are surprising. My guess is that someone optimized Python 3 UTF-8 decoding very intensely, which explains why it’s better than Python 2. And the really terrible times like 40-50s are because some Python code (ie: not C) is manipulating the data, or else data structures are being mutated and it’s breaking CPU caches.
"""Simple benchmark of reading data in Python 2 and Python 3, comparing overhead of string decoding""" import sys, time, codecs fn = sys.argv def run(f): "Run a function, return time to execute it." # timeit is complex overkill s = time.time() f() e = time.time() return e-s def readBytesAtOnce(): fp = open(fn, 'rb') fp.read() def codecsUnicodeAtOnce(): fp = codecs.open(fn, encoding='utf-8') fp.read() def py3UnicodeAtOnce(): fp = open(fn, encoding='utf-8') fp.read() def py3UnicodeAtOnceNoNewlineConversion(): fp = open(fn, encoding='utf-8', newline = '') fp.read() def readBytesByLine(): for l in open(fn, 'rb'): pass def codecsUnicodeByLine(): for l in codecs.open(fn, encoding='utf-8'): pass def py3UnicodeByLine(): for l in open(fn, encoding='utf-8'): pass def py3UnicodeNoNewlineConversionByLine(): for l in open(fn, encoding='utf-8', newline = ''): pass atOnce = [readBytesAtOnce, codecsUnicodeAtOnce] if sys.version_info.major == 3: atOnce.append(py3UnicodeAtOnce) atOnce.append(py3UnicodeAtOnceNoNewlineConversion) byLine = [readBytesByLine, codecsUnicodeByLine] if sys.version_info.major == 3: byLine.append(py3UnicodeByLine) byLine.append(py3UnicodeNoNewlineConversionByLine) open(fn).read() # warm cache for f in byLine: t = run(f) print('%7.2fs %s' % (t, f.__name__))