Leo Kang
Leo Kang

Reputation: 13

Utf8 encoding makes me confused

let buf1 = Buffer.from("3", "utf8");

let buf2 = Buffer.from("Здравствуйте", "utf8");

// <Buffer 33>

// <Buffer d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5>

Why does char '3' encode to '33' in buf1 but 'd0 97' in buf2?

Upvotes: 1

Views: 133

Answers (2)

paxdiablo
paxdiablo

Reputation: 881113

Because 3 is not З, despite the similarity to the untrained eye. Look closer and you'll see the difference, however subtle.

The former is Unicode code point U+0033 - DIGIT THREE (see here), while the latter is U+0417 - CYRILLIC CAPITAL LETTER ZE (see here), encoded in UTF-8 as d0 97.

The Russian word is actually hello, pronounced (very roughly, since I only know hello and goodbye, taught by a Russian girlfriend many decades ago) "Strasvoytza", with no "three" anywhere in the concept.

Upvotes: 2

opr
opr

Reputation: 199

The first character of the second buffer is the Cyrillic character "Ze" https://en.m.wikipedia.org/wiki/Ze_(Cyrillic) and not the Arabic numeral 3 https://en.m.wikipedia.org/wiki/3

Upvotes: 0

Related Questions