Korean writing: Hangul and Unicode

CAVEAT LECTOR: these are my notes after 2 hours of reading Wikipedia articles. I’m a rank amateur, in no way is any of this stuff authoritative. Better resources than this blog post: Learn Hangeul, Let’s Learn Hangul.

I’m interested in Korean writing, specifically Hangul. It’s so graphically beautiful! And I love writing systems. It seems possible to learn enough to pronounce words clumsily in just a few hours.

Hangul 한글 is what modern Korean is written in. But historically there were other writing systems. Hanja 한자 / 漢字 is the big one; was used starting around 4th century BCE and is based on Chinese characters. I think it’s a little like how kanji works for Japanese. Hanja is still in occasional use but I think mostly in text describing historical things or some proper nouns. There are some other historical writing systems too, mostly efforts to use Chinese characters to represent Korean word sounds. There are several modern Romanizations including a standard one. Also Kontsevich, a transliteration into the Cyrillic alphabet.

Hangul was invented in the 15th century by King Sejong 조선 세종 (or his advisors) and published in a written Chinese document called the Hunminjeongeum 훈민정음 / 訓民正音. It wasn’t immediately popular, Wikipedia implies the educated classes were snobbish about their classical Chinese writing. Hangul became official right at the end of the 19th century. It’s been slightly reformed over time, with some phonemes being removed or unified as they fell out of use.

Despite looking a little like Chinese on first blush Hangul is actually a phonetic writing system; each square that looks like a character is actually a syllable composed of (usually) three phonemes. In general Korean is a CVC language. A block like 한 (“han”) is really three letters called jamo 자모. So 한 can be broken down into three jamos drawn in one block: ㅎㅏㄴ (H A N). There’s a claim that Hangul is a featural writing system, that the visual shape of a jamo reflects the shape the mouth makes when making the sound. Not sure how accurate that is.

Jamos are pretty much never written alone. Instead they are combined into a square block with a fairly straightforward algorithm nicely visualized on Wikipedia. The last consonant is at the bottom. The first consonant is either to the left or above the vowel. It gets a little more complicated because some letters have complex forms.

One extra detail: at least in Unicode, jamos come in both leading (choseong) form like ᄀ and trailing (jongseong) form like ᆨ. The shapes look the same but in this font I’m using they’re shown at different heights. I’m not sure why leading vs. trailing is encoded differently or whether Korean native speakers think of them as different things; since jamos don’t normally appear alone it’s a pretty academic point.

English speakers should be aware that Korean phonology is quite different from English or similar European languages. A “t” sound could be a simple t ㄷ, a tense t͈ ㄸ, or an aspirated tʰ ㅌ. These variants are written as different jamos and are basically entirely different consonants even if they sound kind of the same to an English speaker’s ear. Modern Korean is not tonal but historically it was. Hangul used to use diacritics to indicate tone.

Korean in Unicode is a bit of a mess. My understanding is almost all systems encode Korean with Hangul Syllables, characters in the range U+AC00 to U+D7AF. Each code point is a single syllable. Use those and it’s pretty simple, there’s just a lot of them. The odd thing is these 11,184 code points are entirely algorithmically derivable as combinations of the individual jamos that make up the syllables. The graphical combination is also algorithmic. Instead of thousands of syllable code points you could just write Korean unambiguously as strings of 67 different jamos, they are in Unicode (U+1100 to U+11FF). But that’s not what people do in practice (see the road not taken). I think the only time you’re likely to see the jamo form is if you use normalize and decompose input into Unicode NFD.

A second historical complication of Korean Unicode is mentioned in a tart comment in the UTF-8 RFC

Amendment 5 to ISO/IEC 10646, however, has moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version.  Unicode 2.0 has the same difference from Unicode 1.1. The official justification for allowing such an incompatible change was that no implementations and no data containing Hangul existed, a statement that is likely to be true but remains unprovable.  The incident has been dubbed the “Korean mess”, and the relevant committees have pledged to never, ever again make such an incompatible change.

Mentioning this mostly because it’s funny, doesn’t seem relevant to contemporary systems.

Next thing for me is I’d like to learn more about Korean calligraphy and font design. How Korean artists think about their own writing and manipulate it.