Reputation: 444
I'm trying to create a way to split a string by emoji and non-emoji chunks. I managed to get a regex from here and altered to this to take into account the textual variation selector:
(?:(?!(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+\ufe0e))(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+
This works with .match
such as:
'🇦🇨'.match(regex) // (["0x1F1E6", "0x1F1E8"]) => ['🇦🇨']
'🇦🇨'.match(regex) // (["0x1F1E6", "0x1F1E8", "0xFE0E]) => null
But split
isn't giving me the expected results:
'🇦🇨'.split(regex) // (["", undefined, "🇨", ""]) => ['🇦🇨']
I need split
to return the entire emoji in one element. What am I doing wrong?
EDIT:
I have a working regex now, except for the edge case exhibited here: https://regex101.com/r/Vki2ZS/2.
I don't want the second emoji to be matched since it is succeeded by the textual variant selector. I think this is because I'm using lookahead, as the reverse string is matched as expected, but I can't use negative look behind since it's not supported by all browsers.
Upvotes: 2
Views: 310
Reputation: 627468
Your pattern does not work because the second emoji got partly matched with the +
quantified (?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+
: \uD83E\uDD20\uFE0F\uD83E\uDD20
was matched in \uD83E\uDD20\uFE0F\uD83E\uDD20\uFE0E
with two iterations, first \uD83E\uDD20\uFE0F
, then \uD83E\uDD20
.
The pattern you may use with .split
is
/((?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+)/
The main goal was to fail all matches where (?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+
was followed with \uFE0E
, see I added a negative lookahead (?!\ufe0e)
.
JS demo:
var regex = /((?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+)/;
console.log('🇦🇨'.split(regex));
console.log('🤠️🤠︎'.split(regex));
// If you need to wrap the match with some tags:
console.log('🤠️🤠︎'.replace(/(?:(?:\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])+(?!\ufe0e)(?:\ufe0f)?(?:\u200d)?)+/g, '<span class="special">$&</span>'))
Upvotes: 1