Python CSV benchmarks

I tested various ways of reading a CSV file in python, from simply reading the file line by line to using the full unicodecsv DictReader. Here’s what I discovered. Test data is dk.csv, a 3.4M row CSV file with 46 columns. (See also: file reading benchmarks.)

  • The Python2 csv module takes 2x longer than a naive split(‘,’) on every line
  • Python2 DictReader takes 2-3x longer than the simple csv reader that returns tuples
  • Python2 unicodecsv takes 5.5x longer than csv
  • Python3 csv takes 2-3x longer than Python2 csv. However it is Unicode-correct
  • Pandas in Python2 is about the same speed as DictReader, but is Unicode-correct.

I’m not sure why unicodecsv is so slow. I did a quick look in cProfile and all the time is being spent in next() where you’d expect. All those isinstance tests add significant time (20% or so) but that’s not the majority of the 5.5x slowdown. I guess string decoding is just a lot of overhead in Python2? It’s not trivial in Python3 either; I hadn’t realized how much slower string IO was in Py3. I wonder if there’s more going on. Anyway I filed an issue on unicodecsv asking about performance.

I’ve never used Pandas before. I ran into someone else saying unicodecsv is slow who switched to Pandas. It sure is fast! I think it’s a lot of optimized C code. But Pandas is a big package and has its own model of data and I don’t know that I want to buy into all of that. Its CSV module is nicely feature-rich though.

Not sure what conclusion to draw for OpenAddresses. I think we spend ~50% of our time just parsing CSV (for a CSV source like dk or nl). Switching from DictReader to regular reader() is the least pain. Concretely, for a 60 minute job that’d bring the time down to about 40–45 minutes. A nice improvement, but not life altering. Switching to Python3 so we no longer need unicodecsv would also save about the same amount of time.

Python 2 results

 0.17s catDevNull
 0.25s wc
 1.91s pythonRead
 0.58s pythonReadLine
 6.62s dumbCsv
 10.97s csvReader
 29.63s csvDictReader
 62.82s unicodeCsvReader
120.93s unicodeCsvDictReader
 27.93s pandasCsv

Python 3 results

 0.17s catDevNull
 0.25s wc
 5.18s pythonRead
 3.50s pythonReadLine
11.83s dumbCsv
27.77s csvReader
51.37s csvDictReader

The Code

# Benchmark various ways of reading a csv file
# Code works in Python2 and Python3

import sys, time, os, csv, timeit
try:
    import unicodecsv
except: 
    unicodecsv = None
try:
    import pandas
except:
    pandas = None

fn = sys.argv[1]

def warmCache():
    os.system('cat %s > /dev/null' % fn)

def catDevNull():
    os.system('cat %s > /dev/null' % fn)

def wc():
    os.system('wc -l %s > /dev/null' % fn)

def pythonRead():
    fp = open(fn)
    fp.read()

def pythonReadLine():
    fp = open(fn)
    for l in fp:
        pass

def csvReader():
    reader = csv.reader(open(fn, 'r'))
    for l in reader:
        pass

def unicodeCsvReader():
    reader = unicodecsv.reader(open(fn, 'r'))
    for l in reader:
        pass

def csvDictReader():
    reader = csv.DictReader(open(fn, 'r'))
    for l in reader:
        pass

def unicodeCsvDictReader():
    reader = unicodecsv.DictReader(open(fn, 'r'))
    for l in reader:
        pass

def dumbCsv():
    'Really simplistic CSV style parsing'
    fp = open(fn, 'r')
    for l in fp:
        d = l.split(',')

def pandasCsv():
    d = pandas.read_csv(fn, encoding='utf-8')
    # Ensure pandas really read the whole thing
    d.tail(1)

def run(f):
    "Run a function, return time to execute it."
    # timeit is complex overkill
    s = time.time()
    f()
    e = time.time()
    return e-s

warmCache()

functions = [catDevNull, wc, pythonRead, pythonReadLine, dumbCsv, csvReader, csvDictReader]
if unicodecsv:
    functions.append(unicodeCsvReader)
    functions.append(unicodeCsvDictReader)
if pandas:
    functions.append(pandasCsv)

for f in functions:
    t = run(f)
    print('%.2fs %s' % (t, f.__name__))