Reputation: 5316
For long time we used naive approach to split strings in JS:
someString.split('');
But popularity of emoji forced us to change this approach - emoji characters (and other non-BMP characters) like π are made of two "characters'.
String.fromCodePoint(128514).split(''); // array of 2 characters; can't embed due to StackOverflow limitations
So what is modern, correct and performant approach to this task?
Upvotes: 23
Views: 7866
Reputation: 50734
JavaScript has a new API (part of ES2023) called Intl.Segmenter
that allows you to split strings based on graphemes (the user-perceived characters of a string). With this API, your split might look like so:
const split = (str) => {
const itr = new Intl.Segmenter("en", {granularity: 'grapheme'}).segment(str);
return Array.from(itr, ({segment}) => segment);
}
// See browser console for output
console.log(split('π')); // ['π']
console.log(split('eΜ')); // ['eΜ']
console.log(split('π¨βπ©βπ¦')); // ['π¨βπ©βπ¦']
console.log(split('β€οΈ')); // ['β€οΈ']
console.log(split('π±π½ββοΈ')); // ['π±π½ββοΈ']
<p>See browser console for logs</p>
This allows you to not only deal with emojis consisting of two code points such as π
, but other characters also such as composite characters (eg: eΜ
), characters separated by ZWJs (eg: π¨βπ©βπ¦
), characters with variation selectors (eg: β€οΈ), characters with emoji modifiers (eg: π±π½ββοΈ
) etc. all of which can't be handled by invoking the iterator of strings (by using spread ...
, for..of
, Symbol.iterator
etc.) as seen in the other answers, as these will only iterate the code points of your string.
Upvotes: 9
Reputation: 11228
I did something like this somewhere I had to support older browsers and a ES5 minifier, probably will be useful to other
if (Array.from && window.Symbol && window.Symbol.iterator) {
array = Array.from(input[window.Symbol.iterator]());
} else {
array = ...; // maybe `input.split('');` as fallback if it doesn't matter
}
Upvotes: 0
Reputation: 1628
const str = "ππ€πΈπ";
console.log([...str]);
function split(str){
const arr = [];
for(const char of str)
arr.push(char)
return arr;
}
const str = "ππ€πΈπ";
console.log(split(str));
Upvotes: 25
Reputation: 8091
A flag was introduced in ECMA 2015 to support unicode awareness in regex.
Adding u
to your regex returns the complete character in your result.
const withFlag = `ABπDE`.match(/./ug);
const withoutFlag = `ABπDE`.match(/./g);
console.log(withFlag, withoutFlag);
There's a little more about it here
Upvotes: 5
Reputation: 5316
The best approach to this task is to use native String.prototype[Symbol.iterator]
that's aware of Unicode characters. Consequently clean and easy approach to split Unicode character is Array.from
used on string, e.g.:
const string = String.fromCodePoint(128514, 32, 105, 32, 102, 101, 101, 108, 32, 128514, 32, 97, 109, 97, 122, 105, 110, 128514);
Array.from(string);
Upvotes: 10