Reputation: 5625

Why is that JavaScript's strings are using UTF-16 but one character's actual size can be just one byte?

according to this article:

Internally, JavaScript source code is treated as a sequence of UTF-16 code units.

And this IBM doc says that:

UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes).

But I tested in Chrome's console that English letters are only taking 1 byte, not 2 or 4.

new Blob(['a']).size === 1

I wonder why that is the case? Am I missing something here?

Upvotes: 1

Answers (2)

Reputation: 111

Internally, JavaScript source code is treated as a sequence of UTF-16 code units.

Note that this is referring to source code, not String values. String values are referenced to also be UTF-16 later in the article:

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.

The discrepancy here is actually in the Blob constructor. From MDN:

Note that strings here are encoded as UTF-8, unlike the usual JavaScript UTF-16 strings.

Upvotes: 7

Reputation: 24661

UTF has a varying character size.

a has a size of 1 byte, but ą for example has 2

console.log('a', new Blob(['a']).size)
console.log('ą', new Blob(['ą']).size)

Upvotes: -1