Clive
Clive

Reputation: 435

What is a safe length of JavaScript strings?

Considering charAt(), charCodeAt(), and codePointAt() I find a discrepancy between what the parameter means. Before I really thought about it I thought you would always be safe to access the character at length-1. But I read the difference between charCodeAt() and codePointAt() is that charCodeAt() refers to 16-bit (byte pairs) so besides reading i you would also need i+1 if they were surrogate pairs (as is the methodology with UTF-16). Whereas codePointAt() needs a parameter that references the UTF-8 character position (zero based). So now I'm in a quandary as to whether length counts the number of characters, or the number of byte pairs UTF-16 style. I believe JavaScript holds strings as UTF-16, but using length-1 from that on a string that had lots of 4-byte characters with the codePointAt() function would be off the end of the string!!

Upvotes: 5

Views: 559

Answers (2)

Bergi
Bergi

Reputation: 665430

The length of strings is counted in 16-bit unsigned integer values (“elements”) or code units (which together form a valid or invalid UTF16 code unit sequence), and so are its indices. We might also call them "characters".

It doesn't matter whether you access them as properties or via charAt, chatCodeAt and codePointAt, length - 1 will always be a valid index. A code point might however be encoded as a surrogate pair spanning two indices. There is no builtin method to measure the number of these, but the default string iterator will yield them so you can count them using a for … of loop.

Upvotes: 3

Tatsuyuki Ishi
Tatsuyuki Ishi

Reputation: 4031

Use [...str].length for the count of character.

var mb = "𐐷";
console.log(mb.length);
console.log([...mb].length); // "real" length (ES6)
console.log(mb.charAt(0)); // The first two byte
console.log(mb.codePointAt(0)); // The first two byte
console.log(mb.codePointAt(1)); // The second two byte
console.log(mb.charCodeAt(0)); // The four bytes combined (ES6)
console.log(mb.charCodeAt(1)); // The second two byte (ES6)

Upvotes: 2

Related Questions