Stefan
Stefan

Reputation: 114138

Regular expression matching emoji in Mac OS X / iOS

Note: this question could look odd on systems not supporting the included emoji.

This is a follow-up question to How do I remove emoji from string.

I want to build a regular expression that matches all emoji that can be entered in Mac OS X / iOS.

The obvious Unicode blocks cover most, but not all of these emoji:

Wikipedia provides a compiled list of all the symbols available in Apple Color Emoji on OS X Mountain Lion and iOS 6, which looks like a good starting point: (slightly updated)

people  = '๐Ÿ˜„๐Ÿ˜ƒ๐Ÿ˜€๐Ÿ˜Šโ˜บ๏ธ๐Ÿ˜‰๐Ÿ˜๐Ÿ˜˜๐Ÿ˜š๐Ÿ˜—๐Ÿ˜™๐Ÿ˜œ๐Ÿ˜๐Ÿ˜›๐Ÿ˜ณ๐Ÿ˜๐Ÿ˜”๐Ÿ˜Œ๐Ÿ˜’๐Ÿ˜ž๐Ÿ˜ฃ๐Ÿ˜ข๐Ÿ˜‚๐Ÿ˜ญ๐Ÿ˜ช๐Ÿ˜ฅ๐Ÿ˜ฐ๐Ÿ˜…๐Ÿ˜“๐Ÿ˜ฉ๐Ÿ˜ซ๐Ÿ˜จ๐Ÿ˜ฑ๐Ÿ˜ ๐Ÿ˜ก๐Ÿ˜ค๐Ÿ˜–๐Ÿ˜†๐Ÿ˜‹๐Ÿ˜ท๐Ÿ˜Ž๐Ÿ˜ด๐Ÿ˜ต๐Ÿ˜ฒ๐Ÿ˜Ÿ๐Ÿ˜ฆ๐Ÿ˜ง๐Ÿ˜ˆ๐Ÿ‘ฟ๐Ÿ˜ฎ๐Ÿ˜ฌ๐Ÿ˜๐Ÿ˜•๐Ÿ˜ฏ๐Ÿ˜ถ๐Ÿ˜‡๐Ÿ˜๐Ÿ˜‘๐Ÿ‘ฒ๐Ÿ‘ณ๐Ÿ‘ฎ๐Ÿ‘ท๐Ÿ’‚๐Ÿ‘ถ๐Ÿ‘ฆ๐Ÿ‘ง๐Ÿ‘จ๐Ÿ‘ฉ๐Ÿ‘ด๐Ÿ‘ต๐Ÿ‘ฑ๐Ÿ‘ผ๐Ÿ‘ธ๐Ÿ˜บ๐Ÿ˜ธ๐Ÿ˜ป๐Ÿ˜ฝ๐Ÿ˜ผ๐Ÿ™€๐Ÿ˜ฟ๐Ÿ˜น๐Ÿ˜พ๐Ÿ‘น๐Ÿ‘บ๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š๐Ÿ’€๐Ÿ‘ฝ๐Ÿ’ฉ๐Ÿ”ฅโœจ๐ŸŒŸ๐Ÿ’ซ๐Ÿ’ฅ๐Ÿ’ข๐Ÿ’ฆ๐Ÿ’ง๐Ÿ’ค๐Ÿ’จ๐Ÿ‘‚๐Ÿ‘€๐Ÿ‘ƒ๐Ÿ‘…๐Ÿ‘„๐Ÿ‘๐Ÿ‘Ž๐Ÿ‘Œ๐Ÿ‘ŠโœŠโœŒ๐Ÿ‘‹โœ‹๐Ÿ‘๐Ÿ‘†๐Ÿ‘‡๐Ÿ‘‰๐Ÿ‘ˆ๐Ÿ™Œ๐Ÿ™โ˜๐Ÿ‘๐Ÿ’ช๐Ÿšถ๐Ÿƒ๐Ÿ’ƒ๐Ÿ‘ซ๐Ÿ‘ช๐Ÿ‘ฌ๐Ÿ‘ญ๐Ÿ’๐Ÿ’‘๐Ÿ‘ฏ๐Ÿ™†๐Ÿ™…๐Ÿ’๐Ÿ™‹๐Ÿ’†๐Ÿ’‡๐Ÿ’…๐Ÿ‘ฐ๐Ÿ™Ž๐Ÿ™๐Ÿ™‡๐ŸŽฉ๐Ÿ‘‘๐Ÿ‘’๐Ÿ‘Ÿ๐Ÿ‘ž๐Ÿ‘ก๐Ÿ‘ ๐Ÿ‘ข๐Ÿ‘•๐Ÿ‘”๐Ÿ‘š๐Ÿ‘—๐ŸŽฝ๐Ÿ‘–๐Ÿ‘˜๐Ÿ‘™๐Ÿ’ผ๐Ÿ‘œ๐Ÿ‘๐Ÿ‘›๐Ÿ‘“๐ŸŽ€๐ŸŒ‚๐Ÿ’„๐Ÿ’›๐Ÿ’™๐Ÿ’œ๐Ÿ’šโค๐Ÿ’”๐Ÿ’—๐Ÿ’“๐Ÿ’•๐Ÿ’–๐Ÿ’ž๐Ÿ’˜๐Ÿ’Œ๐Ÿ’‹๐Ÿ’๐Ÿ’Ž๐Ÿ‘ค๐Ÿ‘ฅ๐Ÿ’ฌ๐Ÿ‘ฃ๐Ÿ’ญ'
nature  = '๐Ÿถ๐Ÿบ๐Ÿฑ๐Ÿญ๐Ÿน๐Ÿฐ๐Ÿธ๐Ÿฏ๐Ÿจ๐Ÿป๐Ÿท๐Ÿฝ๐Ÿฎ๐Ÿ—๐Ÿต๐Ÿ’๐Ÿด๐Ÿ‘๐Ÿ˜๐Ÿผ๐Ÿง๐Ÿฆ๐Ÿค๐Ÿฅ๐Ÿฃ๐Ÿ”๐Ÿ๐Ÿข๐Ÿ›๐Ÿ๐Ÿœ๐Ÿž๐ŸŒ๐Ÿ™๐Ÿš๐Ÿ ๐ŸŸ๐Ÿฌ๐Ÿณ๐Ÿ‹๐Ÿ„๐Ÿ๐Ÿ€๐Ÿƒ๐Ÿ…๐Ÿ‡๐Ÿ‰๐ŸŽ๐Ÿ๐Ÿ“๐Ÿ•๐Ÿ–๐Ÿ๐Ÿ‚๐Ÿฒ๐Ÿก๐ŸŠ๐Ÿซ๐Ÿช๐Ÿ†๐Ÿˆ๐Ÿฉ๐Ÿพ๐Ÿ’๐ŸŒธ๐ŸŒท๐Ÿ€๐ŸŒน๐ŸŒป๐ŸŒบ๐Ÿ๐Ÿƒ๐Ÿ‚๐ŸŒฟ๐ŸŒพ๐Ÿ„๐ŸŒต๐ŸŒด๐ŸŒฒ๐ŸŒณ๐ŸŒฐ๐ŸŒฑ๐ŸŒผ๐ŸŒ๐ŸŒž๐ŸŒ๐ŸŒš๐ŸŒ‘๐ŸŒ’๐ŸŒ“๐ŸŒ”๐ŸŒ•๐ŸŒ–๐ŸŒ—๐ŸŒ˜๐ŸŒœ๐ŸŒ›๐ŸŒ™๐ŸŒ๐ŸŒŽ๐ŸŒ๐ŸŒ‹๐ŸŒŒ๐ŸŒ โญโ˜€โ›…โ˜โšกโ˜”โ„โ›„๐ŸŒ€๐ŸŒ๐ŸŒˆ๐ŸŒŠ'
objects = '๐ŸŽ๐Ÿ’๐ŸŽŽ๐ŸŽ’๐ŸŽ“๐ŸŽ๐ŸŽ†๐ŸŽ‡๐ŸŽ๐ŸŽ‘๐ŸŽƒ๐Ÿ‘ป๐ŸŽ…๐ŸŽ„๐ŸŽ๐ŸŽ‹๐ŸŽ‰๐ŸŽŠ๐ŸŽˆ๐ŸŽŒ๐Ÿ”ฎ๐ŸŽฅ๐Ÿ“ท๐Ÿ“น๐Ÿ“ผ๐Ÿ’ฟ๐Ÿ“€๐Ÿ’ฝ๐Ÿ’พ๐Ÿ’ป๐Ÿ“ฑโ˜Ž๐Ÿ“ž๐Ÿ“Ÿ๐Ÿ“ ๐Ÿ“ก๐Ÿ“บ๐Ÿ“ป๐Ÿ”Š๐Ÿ”‰๐Ÿ”ˆ๐Ÿ”‡๐Ÿ””๐Ÿ”•๐Ÿ“ข๐Ÿ“ฃโณโŒ›โฐโŒš๐Ÿ”“๐Ÿ”’๐Ÿ”๐Ÿ”๐Ÿ”‘๐Ÿ”Ž๐Ÿ’ก๐Ÿ”ฆ๐Ÿ”†๐Ÿ”…๐Ÿ”Œ๐Ÿ”‹๐Ÿ”๐Ÿ›๐Ÿ›€๐Ÿšฟ๐Ÿšฝ๐Ÿ”ง๐Ÿ”ฉ๐Ÿ”จ๐Ÿšช๐Ÿšฌ๐Ÿ’ฃ๐Ÿ”ซ๐Ÿ”ช๐Ÿ’Š๐Ÿ’‰๐Ÿ’ฐ๐Ÿ’ด๐Ÿ’ต๐Ÿ’ท๐Ÿ’ถ๐Ÿ’ณ๐Ÿ’ธ๐Ÿ“ฒ๐Ÿ“ง๐Ÿ“ฅ๐Ÿ“คโœ‰๐Ÿ“ฉ๐Ÿ“จ๐Ÿ“ฏ๐Ÿ“ซ๐Ÿ“ช๐Ÿ“ฌ๐Ÿ“ญ๐Ÿ“ฎ๐Ÿ“ฆ๐Ÿ“๐Ÿ“„๐Ÿ“ƒ๐Ÿ“‘๐Ÿ“Š๐Ÿ“ˆ๐Ÿ“‰๐Ÿ“œ๐Ÿ“‹๐Ÿ“…๐Ÿ“†๐Ÿ“‡๐Ÿ“๐Ÿ“‚โœ‚๐Ÿ“Œ๐Ÿ“Žโœ’โœ๐Ÿ“๐Ÿ“๐Ÿ“•๐Ÿ“—๐Ÿ“˜๐Ÿ“™๐Ÿ““๐Ÿ“”๐Ÿ“’๐Ÿ“š๐Ÿ“–๐Ÿ”–๐Ÿ“›๐Ÿ”ฌ๐Ÿ”ญ๐Ÿ“ฐ๐ŸŽจ๐ŸŽฌ๐ŸŽค๐ŸŽง๐ŸŽผ๐ŸŽต๐ŸŽถ๐ŸŽน๐ŸŽป๐ŸŽบ๐ŸŽท๐ŸŽธ๐Ÿ‘พ๐ŸŽฎ๐Ÿƒ๐ŸŽด๐Ÿ€„๐ŸŽฒ๐ŸŽฏ๐Ÿˆ๐Ÿ€โšฝโšพ๐ŸŽพ๐ŸŽฑ๐Ÿ‰๐ŸŽณโ›ณ๐Ÿšต๐Ÿšด๐Ÿ๐Ÿ‡๐Ÿ†๐ŸŽฟ๐Ÿ‚๐ŸŠ๐Ÿ„๐ŸŽฃโ˜•๐Ÿต๐Ÿถ๐Ÿผ๐Ÿบ๐Ÿป๐Ÿธ๐Ÿน๐Ÿท๐Ÿด๐Ÿ•๐Ÿ”๐ŸŸ๐Ÿ—๐Ÿ–๐Ÿ๐Ÿ›๐Ÿค๐Ÿฑ๐Ÿฃ๐Ÿฅ๐Ÿ™๐Ÿ˜๐Ÿš๐Ÿœ๐Ÿฒ๐Ÿข๐Ÿก๐Ÿณ๐Ÿž๐Ÿฉ๐Ÿฎ๐Ÿฆ๐Ÿจ๐Ÿง๐ŸŽ‚๐Ÿฐ๐Ÿช๐Ÿซ๐Ÿฌ๐Ÿญ๐Ÿฏ๐ŸŽ๐Ÿ๐ŸŠ๐Ÿ‹๐Ÿ’๐Ÿ‡๐Ÿ‰๐Ÿ“๐Ÿ‘๐Ÿˆ๐ŸŒ๐Ÿ๐Ÿ๐Ÿ ๐Ÿ†๐Ÿ…๐ŸŒฝ'
places  = '๐Ÿ ๐Ÿก๐Ÿซ๐Ÿข๐Ÿฃ๐Ÿฅ๐Ÿฆ๐Ÿช๐Ÿฉ๐Ÿจ๐Ÿ’’โ›ช๐Ÿฌ๐Ÿค๐ŸŒ‡๐ŸŒ†๐Ÿฏ๐Ÿฐโ›บ๐Ÿญ๐Ÿ—ผ๐Ÿ—พ๐Ÿ—ป๐ŸŒ„๐ŸŒ…๐ŸŒƒ๐Ÿ—ฝ๐ŸŒ‰๐ŸŽ ๐ŸŽกโ›ฒ๐ŸŽข๐Ÿšขโ›ต๐Ÿšค๐Ÿšฃโš“๐Ÿš€โœˆ๐Ÿ’บ๐Ÿš๐Ÿš‚๐ŸšŠ๐Ÿš‰๐Ÿšž๐Ÿš†๐Ÿš„๐Ÿš…๐Ÿšˆ๐Ÿš‡๐Ÿš๐Ÿš‹๐Ÿšƒ๐ŸšŽ๐ŸšŒ๐Ÿš๐Ÿš™๐Ÿš˜๐Ÿš—๐Ÿš•๐Ÿš–๐Ÿš›๐Ÿšš๐Ÿšจ๐Ÿš“๐Ÿš”๐Ÿš’๐Ÿš‘๐Ÿš๐Ÿšฒ๐Ÿšก๐ŸšŸ๐Ÿš ๐Ÿšœ๐Ÿ’ˆ๐Ÿš๐ŸŽซ๐Ÿšฆ๐Ÿšฅโš ๐Ÿšง๐Ÿ”ฐโ›ฝ๐Ÿฎ๐ŸŽฐโ™จ๐Ÿ—ฟ๐ŸŽช๐ŸŽญ๐Ÿ“๐Ÿšฉ๐Ÿ‡ฏ๐Ÿ‡ต๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡จ๐Ÿ‡ณ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ท๐Ÿ‡บ๐Ÿ‡ฌ๐Ÿ‡ง'
symbols = '1๏ธโƒฃ2๏ธโƒฃ3๏ธโƒฃ4๏ธโƒฃ5๏ธโƒฃ6๏ธโƒฃ7๏ธโƒฃ8๏ธโƒฃ9๏ธโƒฃ0๏ธโƒฃ๐Ÿ”Ÿ๐Ÿ”ข#๏ธโƒฃ๐Ÿ”ฃโฌ†๏ธโฌ‡๏ธโฌ…๏ธโžก๏ธ๐Ÿ” ๐Ÿ”ก๐Ÿ”คโ†—๏ธโ†–๏ธโ†˜๏ธโ†™๏ธโ†”๏ธโ†•๏ธ๐Ÿ”„โ—€๏ธโ–ถ๏ธ๐Ÿ”ผ๐Ÿ”ฝโ†ฉ๏ธโ†ช๏ธโ„น๏ธโชโฉโซโฌโคต๏ธโคด๏ธ๐Ÿ†—๐Ÿ”€๐Ÿ”๐Ÿ”‚๐Ÿ†•๐Ÿ†™๐Ÿ†’๐Ÿ†“๐Ÿ†–๐Ÿ“ถ๐ŸŽฆ๐Ÿˆ๐Ÿˆฏ๐Ÿˆณ๐Ÿˆต๐Ÿˆด๐Ÿˆฒ๐Ÿ‰๐Ÿˆน๐Ÿˆบ๐Ÿˆถ๐Ÿˆš๐Ÿšป๐Ÿšน๐Ÿšบ๐Ÿšผ๐Ÿšพ๐Ÿšฐ๐Ÿšฎ๐Ÿ…ฟ๏ธโ™ฟ๏ธ๐Ÿšญ๐Ÿˆท๐Ÿˆธ๐Ÿˆ‚โ“‚๏ธ๐Ÿ›‚๐Ÿ›„๐Ÿ›…๐Ÿ›ƒ๐Ÿ‰‘ใŠ™๏ธใŠ—๏ธ๐Ÿ†‘๐Ÿ†˜๐Ÿ†”๐Ÿšซ๐Ÿ”ž๐Ÿ“ต๐Ÿšฏ๐Ÿšฑ๐Ÿšณ๐Ÿšท๐Ÿšธโ›”โœณ๏ธโ‡๏ธโŽโœ…โœด๏ธ๐Ÿ’Ÿ๐Ÿ†š๐Ÿ“ณ๐Ÿ“ด๐Ÿ…ฐ๐Ÿ…ฑ๐Ÿ†Ž๐Ÿ…พ๐Ÿ’ โžฟโ™ป๏ธโ™ˆ๏ธโ™‰๏ธโ™Š๏ธโ™‹๏ธโ™Œ๏ธโ™๏ธโ™Ž๏ธโ™๏ธโ™๏ธโ™‘๏ธโ™’๏ธโ™“๏ธโ›Ž๐Ÿ”ฏ๐Ÿง๐Ÿ’น๐Ÿ’ฒ๐Ÿ’ฑยฉ๏ธยฎ๏ธโ„ข๏ธโŒโ€ผ๏ธโ‰๏ธโ—โ“โ•โ”โญ•๐Ÿ”๐Ÿ”š๐Ÿ”™๐Ÿ”›๐Ÿ”œ๐Ÿ”ƒ๐Ÿ•›๐Ÿ•ง๐Ÿ•๐Ÿ•œ๐Ÿ•‘๐Ÿ•๐Ÿ•’๐Ÿ•ž๐Ÿ•“๐Ÿ•Ÿ๐Ÿ•”๐Ÿ• ๐Ÿ••๐Ÿ•–๐Ÿ•—๐Ÿ•˜๐Ÿ•™๐Ÿ•š๐Ÿ•ก๐Ÿ•ข๐Ÿ•ฃ๐Ÿ•ค๐Ÿ•ฅ๐Ÿ•ฆโœ–๏ธโž•โž–โž—โ™ โ™ฅโ™ฃโ™ฆ๐Ÿ’ฎ๐Ÿ’ฏโœ”โ˜‘๐Ÿ”˜๐Ÿ”—โžฐใ€ฐใ€ฝ๏ธ๐Ÿ”ฑโ—ผ๏ธโ—ป๏ธโ—พ๏ธโ—ฝ๏ธโ–ช๏ธโ–ซ๏ธ๐Ÿ”บ๐Ÿ”ฒ๐Ÿ”ณโšซ๏ธโšช๏ธ๐Ÿ”ด๐Ÿ”ต๐Ÿ”ปโฌœ๏ธโฌ›๏ธ๐Ÿ”ถ๐Ÿ”ท๐Ÿ”ธ๐Ÿ”น'

emoji = people + nature + objects + places + symbols # all emoji combined

Most characters have a single code point and converting these would be easy:

But some characters are "encoded using two Unicode values":

And some even have 3 codepoints:

(Variation Selector 16 means "emoji style")

How can I split this list into characters (without splitting combined characters), find their code point(s) and finally build a regular expression matching them?

The regex doesn't have to respect "missing" characters within larger blocks, i.e. it's okay if the 4 Unicode blocks mentioned above are entirely covered.

(I'm going to answer this myself if I don't get any answers, but maybe there's an easy solution)

Upvotes: 9

Views: 5915

Answers (2)

franklsf95
franklsf95

Reputation: 1252

This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:

[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]

Examples can be found here: https://stackoverflow.com/a/29115920/1911674

EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.

Upvotes: 3

Mathias Bynens
Mathias Bynens

Reputation: 149484

The upcoming Unicode Emoji data files would help with this. At the moment these are still drafts, but they might still help you out.

By parsing http://www.unicode.org/Public/emoji/1.0/emoji-data.txt you could get quite easily get a list of all emoji in the Unicode standard. (Note that some of these emoji consist of multiple code points.) Once you have such a list, itโ€™s trivial to turn it into a regular expression.

Hereโ€™s a JavaScript version: https://github.com/mathiasbynens/emoji-regex/blob/master/index.js And hereโ€™s the script that generates it based on the data from emoji-data.txt: https://github.com/mathiasbynens/emoji-regex/blob/master/scripts/generate-regex.js

Upvotes: 4

Related Questions