Waruyama
Waruyama

Reputation: 3533

OpenType - Two Khmer chars become three before mapping to glyph Ids

I have an interesting problem with processing Khmer text.

The text "កើ" is a string of length two in Unicode. See snipped below for the char codes.

let textbox = document.getElementById('textbox');
let info = document.getElementById('info');

let text = "កើ"

textbox.setAttribute('value', text);

info.innerHTML = "length: " + text.length + "<br>codes: " + text.split('').map(c => c.charCodeAt(0))
<input id="textbox" type="text" style="font-size:80px; width: 2em;"/>
<div id="info"></div>

Text renderers seem to compose this text of three glyphs, or replace the three characters with ligatures. So far this is exotic but not unexpected.

Here is the puzzling thing: When I type this text into the Crowbar text shaping debugger at http://www.corvelsoftware.co.uk/crowbar/ using the Khmer font from Google Fonts, one can see that the two characters are mapped to three glyphs. But the two characters seem to become three characters even before the mapping. Character 6081 appears out of thin air.

enter image description here

I took a deep dive into the internals of the font file, and there is only one subtable in the cmap table, which maps character codes to glpyh ids. This table has format 4, which is pretty standard and only allows one-to-one mappings, so there is no additional glyph inserted during cmap processing.

Also, if only the two original char codes are mapped to glyphs, the resulting text will look different, so the third character seems to be necessary.

What step am I missing here that adds the third character before the character to glyph id mapping? There seems to be some preprocessing of the text taking place that I am not aware of.

Upvotes: 1

Views: 240

Answers (2)

Osify
Osify

Reputation: 2295

You might wrote using wrong vowel:

កើ (U+1780U+17BE) and កេី (U+1780U+17C1U+17B8) : these two are different.

As if using wrong key of vowel, the glyph still appear as the same but they are constructed from different key.

Try with this https://r12a.github.io/app-conversion/

Upvotes: 0

Richard Wordingham
Richard Wordingham

Reputation: 11

As @Waruyama suggested, the answer is documented in https://learn.microsoft.com/en-us/typography/script-development/khmer. It is buried in the entry for 'Vowel' in the glossary there, and says

The shaping engine will take care of pre-pending the syllable, with the glyph piece shaped like U+17C1.

By 'syllable', the specification means the first part of these vowels, which has the same shape as U+17C1 KHMER VOWEL SIGN E. Therefore HarfBuzz expands the input string from <U+1780, U+17BE> to <U+1780, U+17C1, U+17BE>. At this point, the font has been consulted only to confirm that it has GSUB instructions for the Khmer script. Next, it applies the cmap table from the font.

Upvotes: 1

Related Questions