CGI and ISO-8859-1

Oh, the shame, look at the mojibake that showed up in the Atom feed for my real blog:

In detail, what happened… I get

From⎵ SMTP envelope

 

As opposed to

In detail, what happened… I get

From⎵ SMTP envelope

 

A quick wget -S ‘http://www.somebits.com/weblog/index.atom’ and I found the Atom feed was setting “Content-Type: application/atom+xml; charset=ISO-8859-1”. Like some cretin.

I have no idea why this charset was getting set. My ancient blog software (Blosxom, with a third party Atom plugin) doesn’t seem to be setting the charset explicitly. My best guess is that either Perl or Apache’s CGI helpfully inserts ISO-8859-1 as a default if nothing was explicitly said. Anyway I kludged the problem away by creating blosxom/entries/content_type.atom with the contents “application/atom+xml; charset=UTF-8”.

The Atom document itself specifies <?xml version=”1.0″ encoding=”utf-8″?>. Annoyingly Google Reader seems to believe the HTTP transport charset over the document itself. I have no idea what the spec says you’re supposed to do in this scenario, but I bet 9 times out of 10 the document is more correct than the server.

UTF-8 only, ever, and always.