For my wind visualization I have a bunch of 2-3 megabyte text files containing a lot of METARs. I parse them to get the winds data, then do some statistics on them to produce a rollup.

Doing all this work in one swoop takes 5.3s for a smallish file (1.5 megs, 24000 lines).

Parsing the file and writing it to CSV takes 4.6s. Reading the CSV in and doing the rollups takes 1.4s. So I’ve added about 0.7s overhead in splitting the two steps thanks to the intermediate file. OTOH I should really only need to generate that CSV once.

Once again I learned the hard way that iterating line by line through gzip.open() is really slow; way faster to read the whole file in to RAM, a cStringIO if necessary. Terrible buffering in Python’s library.

I’m beginning to regret not just loading all this crap into a relational database.