Typographical quotes in 2023

Working on my new blog engine. I want to have “typographical quotes”, ie: “smart quotes”. Nice curly things that look like typesetting, not straight ‘ or ” from some old typewriter. I am most interested in American English’s use of double quotes (open and close) and the apostrophe character we use in the middle of a word for contractions and plurals.

Annoyed I have to do this in my HTML; browsers could do a better job here, applying English curly quote rules to text input that looks like straight quotes.

Side note: testing all this was hard because something in Windows, PowerToys, or Windows Terminal is converting stuff I copy and paste. I can have a U+2019 in my terminal window from cat or from a web page and then when I paste it it turns into a U+0022. It was doing this with HTML entities I copied too. So I’m doing cut -c x..y | od -t x1 and dealing with cut not supporting UTF-8.

Side note 2: blogging about all this is impossible. The WordPress editor is rewriting stuff I type, replacing various crafted markup examples with literal quote glyphs.

Unicode

Unicode double quotes are easy. U+201C and U+201D. Right name, right character category, done. Single quotes used for quoting are also easy: U+2018 and U+2019.

Unicode apostrophe is a mess. The ASCII U+0027 APOSTROPHE is what I type, and has the right name and categorization (“Other Punctuation”). But it looks wrong in most renderings, straight up and down and not curly. It’s kept that way for historical reasons.

U+2019 RIGHT SINGLE QUOTATION MARK is what most people use to get a curly apostrophe-looking thing. It works most places and is labelled in Unicode as “FINAL PUNCTUATION”. Which is half correct; it is punctuation. But also half wrong: apostrophe is not final. Also the Unicode character is actually a quote mark, not an apostrophe at all.

Some folks argue U+02BC MODIFIER LETTER APOSTROPHE is the right thing. But that’s the obscure choice. Also it has category “Modifier Letter” because it’s usually used as a letter, not punctuation.

I think U+2019 is the practical choice, , because that’s what most people do already. I assume it’s too late for Unicode to fix this by adding a proper apostrophe. The pedants probably think U+0027 is sufficient.

HTML

The most straightforward thing is just to embed the Unicode in UTF-8. That should be fine in 2023.

The older thing most people do is use Unicode entities so the output is ASCII with HTML encoding. For some reason those tend to be decimal, so you get for apostrophe and and for double quotes. (Hilarious; I’d typed the entities for 8220, 8221, and 8217 in my WordPress editor. When I came back what I typed was replaced by the actual glyphs.)

The W3C has a page from 2015 with different advice. It suggests named entities, using “ and ” for double quotes, and ’ for apostrophe. There are also ' and " entities which presumably are to be avoided.

Automatic conversion and Markdown

I’m lazy and want to type ASCII straight ‘ or ” and have software convert them to proper quotes. “Smart Quotes”. It’s important to do this in a way that’s aware of HTML markup; you don’t want to be replacing ” in URLs, code fragments, etc.

The CommonMark spec doesn’t seem to say anything about typographer’s quotes. All their test cases replace ASCII ” with HTML entity " but I’m not clear why, or if the spec demands it. Some sort of escaping safety thing I guess? It’s awkward IMHO. This discussion says the reference implementation does some smart quotes but it’s not part of the spec.

In the ancient days we used “SmartyPants” to do a typographic conversion for us. There’s a modernish Python implementation that seems to work. It uses 8217, 8220, and 8221. Good enough for me! But it’s awkward to plug in to other markdown frameworks, in particular smartypants is re-parsing the input and it’d be nice to skip that.

Despite that mistletoe + smartypants works fine.

mistune and marko do not work with smartypants. Both seem to be converting ” into " like CommonMark’s demo tool does and smartypants won’t pick that up and replace them with 8220/8221.

The old pre-CommonMark Python markdown library has an extension called “SmartyPants”. It seems to be emitting named entities like “. It doesn’t do apostrophes?

The Markdown-it-py extension has typographic components for quotes and apostrophes. This seems to work well and as a bonus emits actual UTF-8 for the Unicode themselves, not ASCII escapes. I’m not afraid of UTF-8. That may be enough to get me to use Markdown-it-py!

One thought on “Typographical quotes in 2023

  1. Perhaps stick to utf-8 in your blog, use html5?
    If you need character level translation xslt offers that as a single step on output?

Comments are closed.