forresto
forresto

Reputation: 12397

How can I split a string containing emoji into an array?

I want to take a string of emoji and do something with the individual characters.

In JavaScript "๐Ÿ˜ด๐Ÿ˜„๐Ÿ˜ƒโ›”๐ŸŽ ๐Ÿš“๐Ÿš‡".length == 13 because "โ›”" length is 1, the rest are 2. So we can't do

const string = "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง ๐Ÿ‘ฆ๐Ÿพ ๐Ÿ˜ด ๐Ÿ˜„ ๐Ÿ˜ƒ โ›” ๐ŸŽ  ๐Ÿš“ ๐Ÿš‡";

const s = string.split(""); 
console.log(s);

const a = Array.from(string);
console.log(a);

Upvotes: 55

Views: 23970

Answers (8)

Ihar Spurhiash
Ihar Spurhiash

Reputation: 19

You can use Array.from(string) instead of string.split("").
Documentation on MDN

Note that this doesn't work with emoji like ๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง and ๐Ÿ‘ฆ๐Ÿพ.

const string = "๐Ÿ˜ด๐Ÿ˜„๐Ÿ˜ƒโ›”๐ŸŽ ๐Ÿš“๐Ÿš‡,๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง,๐Ÿ‘ฆ๐Ÿพ";
console.log(Array.from(string));

Upvotes: 1

rootEnginear
rootEnginear

Reputation: 356

With the upcoming Intl.Segmenter. You can do this:

const splitEmoji = (string) => [...new Intl.Segmenter().segment(string)].map(x => x.segment)

splitEmoji("๐Ÿ˜ด๐Ÿ˜„๐Ÿ˜ƒโ›”๐ŸŽ ๐Ÿš“๐Ÿš‡") // ['๐Ÿ˜ด', '๐Ÿ˜„', '๐Ÿ˜ƒ', 'โ›”', '๐ŸŽ ', '๐Ÿš“', '๐Ÿš‡']

This also solve the problem with "๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง" and "๐Ÿ‘ฆ๐Ÿพ".

splitEmoji("๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง๐Ÿ‘ฆ๐Ÿพ") // ['๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง', '๐Ÿ‘ฆ๐Ÿพ']

According to CanIUse, this is supported by all modern browsers.

If you need to support older browsers, as mentioned in Matt Davies' answer, Graphemer is the best solution:

let Graphemer = await import("https://cdn.jsdelivr.net/npm/[email protected]/+esm").then(m => m.default.default);
let splitter = new Graphemer();
let graphemes = splitter.splitGraphemes("๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง๐Ÿ‘ฆ๐Ÿพ"); // ['๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง', '๐Ÿ‘ฆ๐Ÿพ']

Upvotes: 34

Matt Davies
Matt Davies

Reputation: 141

The Grapheme Splitter library by Orlin Georgiev is pretty amazing.

Although it hasn't been updated in a while and presently (Sep 2020) it only supports Unicode 10 and below.

For an updated version of Grapheme Splitter built in Typescript with Unicode 13 support have a look at: https://github.com/flmnt/graphemer

Here is a quick example:

import Graphemer from 'graphemer';

const splitter = new Graphemer();

const string = "๐Ÿ˜ด๐Ÿ˜„๐Ÿ˜ƒโ›”๐ŸŽ ๐Ÿš“๐Ÿš‡";

splitter.countGraphemes(string); // returns 7

splitter.splitGraphemes(string); // returns array of characters

The library also works with the latest emojis.

For example "๐Ÿ‘ฉ๐Ÿปโ€๐Ÿฆฐ".length === 7 but splitter.countGraphemes("๐Ÿ‘ฉ๐Ÿปโ€๐Ÿฆฐ") === 1.

Full disclosure: I created the library and did the work to update to Unicode 13. The API is identical to Grapheme Splitter and is entirely based on that work, just updated to the latest version of Unicode as the original library hasn't been updated for a couple of years and seems to be no longer maintained.

Upvotes: 12

ArtEze
ArtEze

Reputation: 226

It can be done using the u flag of a regular expression. The regular expression is:

/.*?/u

This is broken every time there are there are at least minimally zero or more characters that may or may not be emojis, but cannot be spaces or new lines break.

  • There are at least minimally zero or more: ? (split in zero chars)
  • Zero or more: *
  • Cannot be spaces or new line break: .
  • May or may not be emojis: /u

By using the question mark ? I am forcing to cut exactly every zero chars, otherwise /.*/u it cuts by all characters until I find a space or newline break.

var string = "๐Ÿ˜ด๐Ÿ˜„๐Ÿ˜ƒโ›”๐ŸŽ ๐Ÿš“๐Ÿš‡"
var c = string.split(/.*?/u)
console.log(c)

Upvotes: 8

Ruben Reyes
Ruben Reyes

Reputation: 783

The modern / proper way to split a UTF8 string is using Array.from(str) instead of str.split('')

Upvotes: 14

forresto
forresto

Reputation: 12397

Edit: see Orlin Georgiev's answer for a proper solution in a library: https://github.com/orling/grapheme-splitter


Thanks to this answer I made a function that takes a string and returns an array of emoji:

var emojiStringToArray = function (str) {
  split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
  arr = [];
  for (var i=0; i<split.length; i++) {
    char = split[i]
    if (char !== "") {
      arr.push(char);
    }
  }
  return arr;
};

So

emojiStringToArray("๐Ÿ˜ด๐Ÿ˜„๐Ÿ˜ƒโ›”๐ŸŽ ๐Ÿš“๐Ÿš‡")
// => Array [ "๐Ÿ˜ด", "๐Ÿ˜„", "๐Ÿ˜ƒ", "โ›”", "๐ŸŽ ", "๐Ÿš“", "๐Ÿš‡" ]

Upvotes: 27

Downgoat
Downgoat

Reputation: 14371

JavaScript ES6 has a solution!, for a real split:

[..."๐Ÿ˜ด๐Ÿ˜„๐Ÿ˜ƒโ›”๐ŸŽ ๐Ÿš“๐Ÿš‡"] // ["๐Ÿ˜ด", "๐Ÿ˜„", "๐Ÿ˜ƒ", "โ›”", "๐ŸŽ ", "๐Ÿš“", "๐Ÿš‡"]

Yay? Except for the fact that when you run this through your transpiler, it might not work (see @brainkim's comment). It only works when natively run on an ES6-compliant browser. Luckily this encompasses most browsers (Safari, Chrome, FF), but if you're looking for high browser compatibility this is not the solution for you.

Upvotes: 37

Orlin Georgiev
Orlin Georgiev

Reputation: 1481

The grapheme-splitter library that does just that, is fully compatible even with old browsers and works not just with emoji but all sorts of exotic characters: https://github.com/orling/grapheme-splitter You are likely to miss edge-cases in any home-brew solution. This one is actually based on the UAX-29 Unicode standart

Upvotes: 22

Related Questions