I tested various ways of reading a CSV file in python, from simply reading the file line by line to using the full unicodecsv DictReader. Here’s what I discovered. Test data is dk.csv, a 3.4M row CSV file with 46 columns. (See also: file reading benchmarks.)
- The Python2 csv module takes 2x longer than a naive split(‘,’) on every line
- Python2 DictReader takes 2-3x longer than the simple csv reader that returns tuples
- Python2 unicodecsv takes 5.5x longer than csv
- Python3 csv takes 2-3x longer than Python2 csv. However it is Unicode-correct
- Pandas in Python2 is about the same speed as DictReader, but is Unicode-correct.
I’m not sure why unicodecsv is so slow. I did a quick look in cProfile and all the time is being spent in next() where you’d expect. All those isinstance tests add significant time (20% or so) but that’s not the majority of the 5.5x slowdown. I guess string decoding is just a lot of overhead in Python2? It’s not trivial in Python3 either; I hadn’t realized how much slower string IO was in Py3. I wonder if there’s more going on. Anyway I filed an issue on unicodecsv asking about performance.
I’ve never used Pandas before. I ran into someone else saying unicodecsv is slow who switched to Pandas. It sure is fast! I think it’s a lot of optimized C code. But Pandas is a big package and has its own model of data and I don’t know that I want to buy into all of that. Its CSV module is nicely feature-rich though.
Not sure what conclusion to draw for OpenAddresses. I think we spend ~50% of our time just parsing CSV (for a CSV source like dk or nl). Switching from DictReader to regular reader() is the least pain. Concretely, for a 60 minute job that’d bring the time down to about 40–45 minutes. A nice improvement, but not life altering. Switching to Python3 so we no longer need unicodecsv would also save about the same amount of time.
Python 2 results
0.17s catDevNull 0.25s wc 1.91s pythonRead 0.58s pythonReadLine 6.62s dumbCsv 10.97s csvReader 29.63s csvDictReader 62.82s unicodeCsvReader 120.93s unicodeCsvDictReader 27.93s pandasCsv
Python 3 results
0.17s catDevNull 0.25s wc 5.18s pythonRead 3.50s pythonReadLine 11.83s dumbCsv 27.77s csvReader 51.37s csvDictReader
The Code
# Benchmark various ways of reading a csv file # Code works in Python2 and Python3 import sys, time, os, csv, timeit try: import unicodecsv except: unicodecsv = None try: import pandas except: pandas = None fn = sys.argv[1] def warmCache(): os.system('cat %s > /dev/null' % fn) def catDevNull(): os.system('cat %s > /dev/null' % fn) def wc(): os.system('wc -l %s > /dev/null' % fn) def pythonRead(): fp = open(fn) fp.read() def pythonReadLine(): fp = open(fn) for l in fp: pass def csvReader(): reader = csv.reader(open(fn, 'r')) for l in reader: pass def unicodeCsvReader(): reader = unicodecsv.reader(open(fn, 'r')) for l in reader: pass def csvDictReader(): reader = csv.DictReader(open(fn, 'r')) for l in reader: pass def unicodeCsvDictReader(): reader = unicodecsv.DictReader(open(fn, 'r')) for l in reader: pass def dumbCsv(): 'Really simplistic CSV style parsing' fp = open(fn, 'r') for l in fp: d = l.split(',') def pandasCsv(): d = pandas.read_csv(fn, encoding='utf-8') # Ensure pandas really read the whole thing d.tail(1) def run(f): "Run a function, return time to execute it." # timeit is complex overkill s = time.time() f() e = time.time() return e-s warmCache() functions = [catDevNull, wc, pythonRead, pythonReadLine, dumbCsv, csvReader, csvDictReader] if unicodecsv: functions.append(unicodeCsvReader) functions.append(unicodeCsvDictReader) if pandas: functions.append(pandasCsv) for f in functions: t = run(f) print('%.2fs %s' % (t, f.__name__))