Ginden
Ginden

Reputation: 5316

How to split Unicode string to characters in JavaScript

For long time we used naive approach to split strings in JS:

someString.split('');

But popularity of emoji forced us to change this approach - emoji characters (and other non-BMP characters) like πŸ˜‚ are made of two "characters'.

String.fromCodePoint(128514).split(''); // array of 2 characters; can't embed due to StackOverflow limitations

So what is modern, correct and performant approach to this task?

Upvotes: 23

Views: 7866

Answers (5)

Nick Parsons
Nick Parsons

Reputation: 50734

JavaScript has a new API (part of ES2023) called Intl.Segmenter that allows you to split strings based on graphemes (the user-perceived characters of a string). With this API, your split might look like so:

const split = (str) => {
  const itr = new Intl.Segmenter("en", {granularity: 'grapheme'}).segment(str);
  return Array.from(itr, ({segment}) => segment);
}
// See browser console for output
console.log(split('πŸ˜‚')); // ['πŸ˜‚']
console.log(split('é')); // ['é']
console.log(split('πŸ‘¨β€πŸ‘©β€πŸ‘¦')); // ['πŸ‘¨β€πŸ‘©β€πŸ‘¦']
console.log(split('❀️')); // ['❀️']
console.log(split('πŸ‘±πŸ½β€β™€οΈ')); // ['πŸ‘±πŸ½β€β™€οΈ']
<p>See browser console for logs</p>

This allows you to not only deal with emojis consisting of two code points such as πŸ˜‚, but other characters also such as composite characters (eg: é), characters separated by ZWJs (eg: πŸ‘¨β€πŸ‘©β€πŸ‘¦), characters with variation selectors (eg: ❀️), characters with emoji modifiers (eg: πŸ‘±πŸ½β€β™€οΈ) etc. all of which can't be handled by invoking the iterator of strings (by using spread ..., for..of, Symbol.iterator etc.) as seen in the other answers, as these will only iterate the code points of your string.

Upvotes: 9

Ebrahim Byagowi
Ebrahim Byagowi

Reputation: 11228

I did something like this somewhere I had to support older browsers and a ES5 minifier, probably will be useful to other

    if (Array.from && window.Symbol && window.Symbol.iterator) {
        array = Array.from(input[window.Symbol.iterator]());
    } else {
        array = ...; // maybe `input.split('');` as fallback if it doesn't matter
    }

Upvotes: 0

Omkar76
Omkar76

Reputation: 1628

Using spread in array literal :

const str = "πŸŒπŸ€–πŸ˜ΈπŸŽ‰";
console.log([...str]);

Using for...of :

function split(str){
  const arr = [];
  for(const char of str)
    arr.push(char)
   
  return arr;
}

const str = "πŸŒπŸ€–πŸ˜ΈπŸŽ‰";
console.log(split(str));

Upvotes: 25

robstarbuck
robstarbuck

Reputation: 8091

A flag was introduced in ECMA 2015 to support unicode awareness in regex.

Adding u to your regex returns the complete character in your result.

const withFlag = `ABπŸ˜‚DE`.match(/./ug);
const withoutFlag = `ABπŸ˜‚DE`.match(/./g);

console.log(withFlag, withoutFlag);

There's a little more about it here

Upvotes: 5

Ginden
Ginden

Reputation: 5316

The best approach to this task is to use native String.prototype[Symbol.iterator] that's aware of Unicode characters. Consequently clean and easy approach to split Unicode character is Array.from used on string, e.g.:

const string = String.fromCodePoint(128514, 32, 105, 32, 102, 101, 101, 108, 32, 128514, 32, 97, 109, 97, 122, 105, 110, 128514);
Array.from(string);

Upvotes: 10

Related Questions