A place where bzip2 is useful

I almost always regret using uber-compressor bzip2. It generally compresses to, oh, 90% of the size of gzip but takes 10x as long to compress and 2x as long to decompress. Not worth it, bad space/time tradeoff.

The exception is Apache access log files. Something about the terribly repetitive text of these things is really good for bzip2. Here’s a quick test on a 26 meg access.log (using bzip2 1.0.5 and gzip 1.4 on Debian i686)

gzip: 10.8%, 860ms to compress, 160ms to decompress

bzip2: 5.7%, 12700ms to compress, 1600ms to decompress

The bzip2 is literally half the size of the gzip. That’s a lot. Too bad it took 15x as much time to generate it and 10x as much time to decompress it again. A quick Google suggests most people find bzip2 about 3x-5x slower, not sure why it’s so bad for me. Maybe because my sample is also unusually good at compressing!

Some other fast compression options: pigz, pbzip2threadzip for parallel compressors, or snappy for Google’s very-fast-but-not-as-tight compressor.

Here’s the sample data I tested:


207.46.13.138 - - [04/Jul/2010:06:32:55 +0000] "GET /robots.txt HTTP/1.1" 301 329 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
207.46.13.138 - - [04/Jul/2010:06:32:55 +0000] "GET /robots.txt HTTP/1.1" 200 107 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
61.135.216.105 - - [04/Jul/2010:06:32:55 +0000] "GET /weblog/index.atom HTTP/1.1" 304 - "-" "Mozilla/5.0 (compatible;YoudaoFeedFetcher/1.0;http://www.youdao.com/help/reader/faq/topic006/;2 subscribers;)"