Nelson's log

S3 vs gzip encoding

S3 is an easy way to serve data on the web. It’s also pretty limited. One problem is the S3 web servers don’t do gzip compression on the fly. They don’t parse Accept-Encoding from clients and they don’t opportunistically gzip encode responses. Some details in this blog post.

So if you want to serve content gzipped, your only option is to serve it under two URLs; one with gzip, one without. You’re on your own for figuring out how to tell clients which URL they should use. The Amazon docs talk about how to set this up. In particular you want to hack in a Content-Encoding: gzip in the compressed version so the client knows to decompress it.

There is a hack some folks recommend, which is to only serve a gzip encoded version of your file at the URL. This will work with any modern web browser, but it’s not correct. You’re not supposed to serve gzip to clients that don’t ask for it. This actually matters in practice; curl and wget both don’t know how to handle gzip output. So if you do this hack, those clients will see compressed gibberish.

Note also there’s no way to upload to S3 with compression and create an uncompressed file when it gets there. You have to upload the uncompressed thing. Even uploading the gzip file is a bit tricky because you need to know the full file size to do a simple upload. This StackExchange question discusses how you can do a multi-part upload to sort of stream while you compress.

OpenAddresses wants to serve giant CSV files via gzip. I think the right way to do that is to serve the files explicitly as “foo.csv.gz” and serve a gzip binary blob. No Content-Encoding. That will require the client downloader know to decompress it.

For a business that’s all about charging people for storing and serving bytes, it’s kind of surprising they don’t do the simple obvious thing to give you a 75% reduction in the number of bytes you store and serve. I’m not suggesting any nefarious plot here, but I do wonder what it is about Amazon’s serving infrastructure that makes them unable to do gzip encoding.

Update: Amazon CloudFront started supporting gzip in Dec 2015. Not S3 itself though.