typesanitizer
typesanitizer

Reputation: 2775

Unusual rendering and copy-paste for the character 誤

I'm seeing somewhat unusual behavior around the rendering of 誤 in the browser (works across both Firefox and Chrome), which I'm having trouble explaining.

Specifically, check out the Wiktionary page for 誤:

Screenshot of Wiktionary showing variations of 誤

Notice that there are 3 variations marked in black bold:

  1. The top left one has 3 pieces: 言 + ⼝ + 天
  2. The middle one kinda' has 4 pieces: 言 + ⼝ + a rotated ꒔ + ⼤
  3. The bottom one has 3 pieces: ⻈ + ⼝ + 天

The relation between 2 and 3 is clear: 2 represents the traditional character and 1 represents the simplified character. But what does 1 represent? I've tried the following:

So what is going on with this unusual character rendering and copy-pasting behavior? How can I reproduce character 1 (and not 2) in other applications?

FWIW, when I look at a Chinese dictionary, the stroke order shows character 2 even though the browser renders the character as 1.

MDBG screenshot for 誤

Upvotes: 2

Views: 225

Answers (2)

Manishearth
Manishearth

Reputation: 16198

This is a z-variant, and in this case probably an example of Han unification.

From https://www.zdic.net/hans/%E8%AA%A4:

enter image description here

You can see that the first character (marked as 内地 Mainland China) is what you're getting in the headword.

The headword on Wikipedia is formatted with lang=zh, whereas the example sentences use zh-Hans and zh-Hant respectively, and that's the core of this, along with likely subtags fallback.

Most systems dealing with locales perform locale fallback using likely subtags: So, Hans without any country specified typically implies CN, and Hant implies TW during fallback. The reverse is also true (and some other countries like HK imply Hant as well). Hans/Hant are script codes for Simplified and Traditional Chinese, and CN/TW are China and Taiwan respectively. zh on its own implies zh-CN (and thus zh-Hans-CN)

Fallback also need not always occur the same way, different fonts have different priorities (e.g. a Mainland Chinese font may assume CN by default unless explicitly told otherwise)

I made a little table, screenshot showing the rendering of different language tags on my system when run on Wikipedia (snippet at the bottom of this post)

enter image description here

The font's actually defaulting to Noto Sans CJK JP unless I put it in a class=Hant context (where it switches to Noto Sans CJK TC).

What's happening under the hood is: traditional vs simplified is not unified in Unicode, but such variants are. Even though zh implies zh-Hans-CN, because this is a traditional character, the font will not use the Hans to pick a Simplified character: it must pick a traditional character since Simplified is encoded differently. So you get the Mainland Chinese traditional variant in zh contexts (like the headword), but since zh-Hant implies zh-TW, the font is happy to oblige and give you the Taiwanese (still traditional) variant in the example sentence.

Note that not all cases stick to a single font: sometimes the choice of language can force a different font to be selected (or the precise CSS used). Additionally, you can have z-variants crop up in different contexts without needing to change the language, for example the Cantonese possessive 嘅 can be built as ⿰口既 or ⿰口旣 and the choice is not clearly locale based and seems to vary freely between fonts.


Code for table above:

<table>
 <tr lang=zh><td>zh</td><td>誤</td></tr>
 <tr lang=zh-Hans><td>zh-Hans</td><td>誤</td></tr>
 <tr lang=zh-Hant><td>zh-Hant</td><td>誤</td></tr>
 <tr lang=zh-CN><td>zh-CN</td><td>誤</td></tr>
 <tr lang=zh-Hant-CN><td>zh-Hant-CN</td><td>誤</td></tr>
 <tr lang=zh-Hans-CN><td>zh-Hans-CN</td><td>誤</td></tr>
 <tr lang=zh-TW><td>zh-TW</td><td>誤</td></tr>
 <tr lang=zh-HK><td>zh-HK</td><td>誤</td></tr>
 <tr lang=zh-Hans-TW><td>zh-Hans-TW</td><td>誤</td></tr>
 <tr lang=ja><td>ja</td><td>誤</td></tr>
 <tr lang=ko><td>ko</td><td>誤</td></tr>
 <tr lang=vi><td>vi</td><td>誤</td></tr>
</table>

Upvotes: 1

typesanitizer
typesanitizer

Reputation: 2775

(Based on a Twitter discussion with manishearth)

The difference is coming up due to variations across fonts (called z-variants). Specifically, based on the language tag, the browser can pick different fonts within the same font family (e.g. sans-serif). For example, on my device:

  • With lang="zh", the browser picks PingFang SC from sans-serif.
  • With lang="zh-Hant", the browser picks PingFang TC from sans-serif.

PingFang TC

PingFang SC

These two fonts render the character differently. The lang tag is different in different parts of the HTML, causing different font selection and hence different rendering.

Outside the browser, depending on the language context, the variant/language can also change. There is more discussion of this with examples on the Han Unification Wikipedia page.

Upvotes: 0

Related Questions