Val Ruiz
Val Ruiz

Reputation: 85

How to use substring with special unicode characters?

var string = "abc𝑚";
var lastchar = string.substr(string.length - 1);
console.log(lastchar);

This returns ? instead of 𝑚

Upvotes: 6

Views: 1560

Answers (1)

T.J. Crowder
T.J. Crowder

Reputation: 1074335

In JavaScript, a string is a series of UTF-16 code units (details in my blog post What is a string?). In UTF-16, that last glyph (loosely, "character') requires two code units (which combine to make a single code point), so your string length is 5.

Until ES2015 there wasn't much built into JavaScript to help you with this, But when iterability was introduced, strings were made iterable and they iterate over their code points, not code units. Spread operations use iteration, so you can spread that string out into an array to get at its code points:

const string = "abc𝑚";
console.log(string.length); // 5
const chars = [...string];
console.log(chars.length);  // 4
const lastchar = chars.slice(chars.length - 1).join("");
console.log(lastchar);

That's just an example to demonstrate the distinction and how you can use code points fairly easily.

Even code points aren't necessarily glyphs because some code points combine with other ones to form a single glyph. (For instance, in Devanagari, the word for the language is "देवनागरी" which looks like five glyphs to native readers, but is eight code points because some of them are written with a base syllable glyph modified by a vowel code point after.) There's a new Intl.Segmenter under development that would help with those situations as well.

Upvotes: 8

Related Questions