Python zipfiles: don’t read a line at a time

readline() inside a Python zipfile is slow. It’s much much faster to read the whole thing into memory once, then scan.

Compare:

fp = zipfile.ZipFile("foo.zip").open("data.txt")
# Iterate a line at a time
for l in fp:
    if l.startswith("FOO"): print l

# Read the whole file into memory, grep
regexp = re.compile(r'^(FOO.*)$', re.MULTILINE)
data = fp.read()
if m: print m.group(1)

I’m not surprised that the second style is faster, but I’m surprised how much. Some 25 times faster both on my test case (0.5s vs 12s) and over 3.6 gigs of real data (8 minutes vs 200 minutes). Obviously reading the whole file first takes more RAM, but in my case the files are only 500k each.

A quick profile suggests zipfile spends a lot of time calling next(); maybe it’s doing work a character at a time? Issue 7216 suggests there’s a 100 byte buffer at work; yuck! Fixed in 2.7 or 3.x or something.