ragurney
ragurney

Reputation: 444

Splitting a String with Emoji Regex Respecting Variation Selector 15

I'm trying to create a way to split a string by emoji and non-emoji chunks. I managed to get a regex from here and altered to this to take into account the textual variation selector:

(?:(?!(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+\ufe0e))(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+

This works with .match such as:

'🇦🇨'.match(regex) // (["0x1F1E6", "0x1F1E8"]) => ['🇦🇨']
'🇦🇨'.match(regex) // (["0x1F1E6", "0x1F1E8", "0xFE0E]) => null

But split isn't giving me the expected results:

'🇦🇨'.split(regex) // (["", undefined, "🇨", ""]) => ['🇦🇨']

I need split to return the entire emoji in one element. What am I doing wrong?

EDIT:

I have a working regex now, except for the edge case exhibited here: https://regex101.com/r/Vki2ZS/2.

I don't want the second emoji to be matched since it is succeeded by the textual variant selector. I think this is because I'm using lookahead, as the reverse string is matched as expected, but I can't use negative look behind since it's not supported by all browsers.

Upvotes: 2

Views: 310

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

Your pattern does not work because the second emoji got partly matched with the + quantified (?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+: \uD83E\uDD20\uFE0F\uD83E\uDD20 was matched in \uD83E\uDD20\uFE0F\uD83E\uDD20\uFE0E with two iterations, first \uD83E\uDD20\uFE0F, then \uD83E\uDD20.

The pattern you may use with .split is

/((?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+)/

The main goal was to fail all matches where (?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+ was followed with \uFE0E, see I added a negative lookahead (?!\ufe0e).

JS demo:

var regex = /((?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+)/;
console.log('🇦🇨'.split(regex));
console.log('🤠️🤠︎'.split(regex));

// If you need to wrap the match with some tags:
console.log('🤠️🤠︎'.replace(/(?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+/g, '<span class="special">$&</span>'))

Upvotes: 1

Related Questions