Unusual rendering and copy-paste for the character 誤

Question

I'm seeing somewhat unusual behavior around the rendering of 誤 in the browser (works across both Firefox and Chrome), which I'm having trouble explaining.

Specifically, check out the Wiktionary page for 誤:

Notice that there are 3 variations marked in black bold:

The top left one has 3 pieces: 言 + ⼝ + 天
The middle one kinda' has 4 pieces: 言 + ⼝ + a rotated ꒔ + ⼤
The bottom one has 3 pieces: ⻈ + ⼝ + 天

The relation between 2 and 3 is clear: 2 represents the traditional character and 1 represents the simplified character. But what does 1 represent? I've tried the following:

I tried copying character 1 but when I paste it, it ends up looking like character 2.
I tried various font combinations, both in the browser and in TextEdit, but the appearance and copy-pasting behavior persist.

So what is going on with this unusual character rendering and copy-pasting behavior? How can I reproduce character 1 (and not 2) in other applications?

FWIW, when I look at a Chinese dictionary, the stroke order shows character 2 even though the browser renders the character as 1.

Manishearth · Accepted Answer

This is a z-variant, and in this case probably an example of Han unification.

From https://www.zdic.net/hans/%E8%AA%A4:

You can see that the first character (marked as 内地 Mainland China) is what you're getting in the headword.

The headword on Wikipedia is formatted with lang=zh, whereas the example sentences use zh-Hans and zh-Hant respectively, and that's the core of this, along with likely subtags fallback.

Most systems dealing with locales perform locale fallback using likely subtags: So, Hans without any country specified typically implies CN, and Hant implies TW during fallback. The reverse is also true (and some other countries like HK imply Hant as well). Hans/Hant are script codes for Simplified and Traditional Chinese, and CN/TW are China and Taiwan respectively. zh on its own implies zh-CN (and thus zh-Hans-CN)

Fallback also need not always occur the same way, different fonts have different priorities (e.g. a Mainland Chinese font may assume CN by default unless explicitly told otherwise)

I made a little table, screenshot showing the rendering of different language tags on my system when run on Wikipedia (snippet at the bottom of this post)

The font's actually defaulting to Noto Sans CJK JP unless I put it in a class=Hant context (where it switches to Noto Sans CJK TC).

What's happening under the hood is: traditional vs simplified is not unified in Unicode, but such variants are. Even though zh implies zh-Hans-CN, because this is a traditional character, the font will not use the Hans to pick a Simplified character: it must pick a traditional character since Simplified is encoded differently. So you get the Mainland Chinese traditional variant in zh contexts (like the headword), but since zh-Hant implies zh-TW, the font is happy to oblige and give you the Taiwanese (still traditional) variant in the example sentence.

Note that not all cases stick to a single font: sometimes the choice of language can force a different font to be selected (or the precise CSS used). Additionally, you can have z-variants crop up in different contexts without needing to change the language, for example the Cantonese possessive 嘅 can be built as ⿰口既 or ⿰口旣 and the choice is not clearly locale based and seems to vary freely between fonts.

Code for table above:


 zh 誤
 zh-Hans 誤
 zh-Hant 誤
 zh-CN 誤
 zh-Hant-CN 誤
 zh-Hans-CN 誤
 zh-TW 誤
 zh-HK 誤
 zh-Hans-TW 誤
 ja 誤
 ko 誤
 vi 誤

Unusual rendering and copy-paste for the character 誤

Answers (2)

Related Questions