Reputation: 1219
I have a need for a javascript regex that would match words in any language, but fail for emoji or any other character. Solution here: Regular expression to match non-English characters? matches all letters plus pictograms and emoji ([^\u0000-\u007F]+
).
Modifying it a bit seems to accomplish what I need, but I'm not sure how safe it is: ([a-zA-Z]|[^\u0000-\u007F\u200d-\u3299\ud83c-\udfff\ufe0e\ufe0f])+
Example:
America🇺🇸
Österreich🇦🇹
Россия🇷🇺
Ελλάδα🇬🇷
Should only match letters and stop before emoji. Should not match emojis with letter representations, for example: 1️⃣#️⃣*️⃣
Relevant: http://www.unicode.org/Public/emoji/5.0/emoji-variation-sequences.txt
Bit of context:
I'm trying to patch this parser: https://github.com/Khan/simple-markdown/blob/master/simple-markdown.js#L1304 to break on emojis, because currently it matches as much text as it can. Without that matching/replacing emoji via that parser is problematic. Removing \u00c0-\uffff
from the highlighted regex accomplishes what I need, but parser starts breaking up words. Some languages (cyrrillic) get broken per letter, which is not good for performance. I need to either patch that regex to allow letters, but not emojis, or put a regex that catches all text before it.
Edit: Added some examples
Edit: Added language restriction
Upvotes: 4
Views: 1420
Reputation: 372
In JavaScript before ES2018 (which got added natively to many browsers in mid-2020), the answer is "roll your own" 😱
Here is what I made, after consulting Wikipedia and using this SO answer for cleaning up the endless list of unicode codes:
const westernEurope = '\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u01BF';
// (u00D7 and u00F7 are math symbols)
const cyrillic = '\u0400-\u04FF';
const japan = '\u30A0-\u30FF';
const chinese = '\u4E00-\u9FA5';
const re = new RegExp(`^[a-zA-Z${westernEurope + cyrillic + japan + chinese}]*$`, 'g');
You should also consult Wikipedia if you need other languages or want to double check this (for instance, I only included basic Cyrillic in the cyrillic codes above)
If you can use the latest JavaScript in your project, this answer explains how Unicode Property Escapes are just what we need
Upvotes: 0
Reputation: 1219
I found a solution here: https://mathiasbynens.be/notes/es-unicode-property-escapes#word
Essentially /[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
given Unicode property escapes support.
Until \p
is natively supported in JavaScript, you can transpile this regex.
Upvotes: 3
Reputation: 336178
\pL
matches a Unicode letter.
You might want to combine that Unicode category with \p{Pc}
(connector punctuation) to also catch word combinations like it's
or doesn't
by using a character class: [\pL\p{Pc}]
Upvotes: 0