Reputation: 114138
Note: this question could look odd on systems not supporting the included emoji.
This is a follow-up question to How do I remove emoji from string.
I want to build a regular expression that matches all emoji that can be entered in Mac OS X / iOS.
The obvious Unicode blocks cover most, but not all of these emoji:
Wikipedia provides a compiled list of all the symbols available in Apple Color Emoji on OS X Mountain Lion and iOS 6, which looks like a good starting point: (slightly updated)
people = '๐๐๐๐โบ๏ธ๐๐๐๐๐๐๐๐๐๐ณ๐๐๐๐๐๐ฃ๐ข๐๐ญ๐ช๐ฅ๐ฐ๐
๐๐ฉ๐ซ๐จ๐ฑ๐ ๐ก๐ค๐๐๐๐ท๐๐ด๐ต๐ฒ๐๐ฆ๐ง๐๐ฟ๐ฎ๐ฌ๐๐๐ฏ๐ถ๐๐๐๐ฒ๐ณ๐ฎ๐ท๐๐ถ๐ฆ๐ง๐จ๐ฉ๐ด๐ต๐ฑ๐ผ๐ธ๐บ๐ธ๐ป๐ฝ๐ผ๐๐ฟ๐น๐พ๐น๐บ๐๐๐๐๐ฝ๐ฉ๐ฅโจ๐๐ซ๐ฅ๐ข๐ฆ๐ง๐ค๐จ๐๐๐๐
๐๐๐๐๐โโ๐โ๐๐๐๐๐๐๐โ๐๐ช๐ถ๐๐๐ซ๐ช๐ฌ๐ญ๐๐๐ฏ๐๐
๐๐๐๐๐
๐ฐ๐๐๐๐ฉ๐๐๐๐๐ก๐ ๐ข๐๐๐๐๐ฝ๐๐๐๐ผ๐๐๐๐๐๐๐๐๐๐๐โค๐๐๐๐๐๐๐๐๐๐๐๐ค๐ฅ๐ฌ๐ฃ๐ญ'
nature = '๐ถ๐บ๐ฑ๐ญ๐น๐ฐ๐ธ๐ฏ๐จ๐ป๐ท๐ฝ๐ฎ๐๐ต๐๐ด๐๐๐ผ๐ง๐ฆ๐ค๐ฅ๐ฃ๐๐๐ข๐๐๐๐๐๐๐๐ ๐๐ฌ๐ณ๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐๐ฒ๐ก๐๐ซ๐ช๐๐๐ฉ๐พ๐๐ธ๐ท๐๐น๐ป๐บ๐๐๐๐ฟ๐พ๐๐ต๐ด๐ฒ๐ณ๐ฐ๐ฑ๐ผ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ โญโโ
โโกโโโ๐๐๐๐'
objects = '๐๐๐๐๐๐๐๐๐๐๐๐ป๐
๐๐๐๐๐๐๐๐ฎ๐ฅ๐ท๐น๐ผ๐ฟ๐๐ฝ๐พ๐ป๐ฑโ๐๐๐ ๐ก๐บ๐ป๐๐๐๐๐๐๐ข๐ฃโณโโฐโ๐๐๐๐๐๐๐ก๐ฆ๐๐
๐๐๐๐๐๐ฟ๐ฝ๐ง๐ฉ๐จ๐ช๐ฌ๐ฃ๐ซ๐ช๐๐๐ฐ๐ด๐ต๐ท๐ถ๐ณ๐ธ๐ฒ๐ง๐ฅ๐คโ๐ฉ๐จ๐ฏ๐ซ๐ช๐ฌ๐ญ๐ฎ๐ฆ๐๐๐๐๐๐๐๐๐๐
๐๐๐๐โ๐๐โโ๐๐๐๐๐๐๐๐๐๐๐๐๐๐ฌ๐ญ๐ฐ๐จ๐ฌ๐ค๐ง๐ผ๐ต๐ถ๐น๐ป๐บ๐ท๐ธ๐พ๐ฎ๐๐ด๐๐ฒ๐ฏ๐๐โฝโพ๐พ๐ฑ๐๐ณโณ๐ต๐ด๐๐๐๐ฟ๐๐๐๐ฃโ๐ต๐ถ๐ผ๐บ๐ป๐ธ๐น๐ท๐ด๐๐๐๐๐๐๐๐ค๐ฑ๐ฃ๐ฅ๐๐๐๐๐ฒ๐ข๐ก๐ณ๐๐ฉ๐ฎ๐ฆ๐จ๐ง๐๐ฐ๐ช๐ซ๐ฌ๐ญ๐ฏ๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐๐
๐ฝ'
places = '๐ ๐ก๐ซ๐ข๐ฃ๐ฅ๐ฆ๐ช๐ฉ๐จ๐โช๐ฌ๐ค๐๐๐ฏ๐ฐโบ๐ญ๐ผ๐พ๐ป๐๐
๐๐ฝ๐๐ ๐กโฒ๐ข๐ขโต๐ค๐ฃโ๐โ๐บ๐๐๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐จ๐๐๐๐๐๐ฒ๐ก๐๐ ๐๐๐๐ซ๐ฆ๐ฅโ ๐ง๐ฐโฝ๐ฎ๐ฐโจ๐ฟ๐ช๐ญ๐๐ฉ๐ฏ๐ต๐ฐ๐ท๐ฉ๐ช๐จ๐ณ๐บ๐ธ๐ซ๐ท๐ช๐ธ๐ฎ๐น๐ท๐บ๐ฌ๐ง'
symbols = '1๏ธโฃ2๏ธโฃ3๏ธโฃ4๏ธโฃ5๏ธโฃ6๏ธโฃ7๏ธโฃ8๏ธโฃ9๏ธโฃ0๏ธโฃ๐๐ข#๏ธโฃ๐ฃโฌ๏ธโฌ๏ธโฌ
๏ธโก๏ธ๐ ๐ก๐คโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธ๐โ๏ธโถ๏ธ๐ผ๐ฝโฉ๏ธโช๏ธโน๏ธโชโฉโซโฌโคต๏ธโคด๏ธ๐๐๐๐๐๐๐๐๐๐ถ๐ฆ๐๐ฏ๐ณ๐ต๐ด๐ฒ๐๐น๐บ๐ถ๐๐ป๐น๐บ๐ผ๐พ๐ฐ๐ฎ๐
ฟ๏ธโฟ๏ธ๐ญ๐ท๐ธ๐โ๏ธ๐๐๐
๐๐ใ๏ธใ๏ธ๐๐๐๐ซ๐๐ต๐ฏ๐ฑ๐ณ๐ท๐ธโโณ๏ธโ๏ธโโ
โด๏ธ๐๐๐ณ๐ด๐
ฐ๐
ฑ๐๐
พ๐ โฟโป๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๏ธโ๐ฏ๐ง๐น๐ฒ๐ฑยฉ๏ธยฎ๏ธโข๏ธโโผ๏ธโ๏ธโโโโโญ๐๐๐๐๐๐๐๐ง๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐ข๐ฃ๐ค๐ฅ๐ฆโ๏ธโโโโ โฅโฃโฆ๐ฎ๐ฏโโ๐๐โฐใฐใฝ๏ธ๐ฑโผ๏ธโป๏ธโพ๏ธโฝ๏ธโช๏ธโซ๏ธ๐บ๐ฒ๐ณโซ๏ธโช๏ธ๐ด๐ต๐ปโฌ๏ธโฌ๏ธ๐ถ๐ท๐ธ๐น'
emoji = people + nature + objects + places + symbols # all emoji combined
Most characters have a single code point and converting these would be easy:
But some characters are "encoded using two Unicode values":
And some even have 3 codepoints:
(Variation Selector 16 means "emoji style")
How can I split this list into characters (without splitting combined characters), find their code point(s) and finally build a regular expression matching them?
The regex doesn't have to respect "missing" characters within larger blocks, i.e. it's okay if the 4 Unicode blocks mentioned above are entirely covered.
(I'm going to answer this myself if I don't get any answers, but maybe there's an easy solution)
Upvotes: 9
Views: 5915
Reputation: 1252
This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
Examples can be found here: https://stackoverflow.com/a/29115920/1911674
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.
Upvotes: 3
Reputation: 149484
The upcoming Unicode Emoji data files would help with this. At the moment these are still drafts, but they might still help you out.
By parsing http://www.unicode.org/Public/emoji/1.0/emoji-data.txt you could get quite easily get a list of all emoji in the Unicode standard. (Note that some of these emoji consist of multiple code points.) Once you have such a list, itโs trivial to turn it into a regular expression.
Hereโs a JavaScript version: https://github.com/mathiasbynens/emoji-regex/blob/master/index.js And hereโs the script that generates it based on the data from emoji-data.txt
: https://github.com/mathiasbynens/emoji-regex/blob/master/scripts/generate-regex.js
Upvotes: 4