Max
Max

Reputation: 1219

Regex matching letters – including non-latin, but excluding emoji

I have a need for a javascript regex that would match words in any language, but fail for emoji or any other character. Solution here: Regular expression to match non-English characters? matches all letters plus pictograms and emoji ([^\u0000-\u007F]+).

Modifying it a bit seems to accomplish what I need, but I'm not sure how safe it is: ([a-zA-Z]|[^\u0000-\u007F\u200d-\u3299\ud83c-\udfff\ufe0e\ufe0f])+

Example: America🇺🇸 Österreich🇦🇹 Россия🇷🇺 Ελλάδα🇬🇷

Should only match letters and stop before emoji. Should not match emojis with letter representations, for example: 1️⃣#️⃣*️⃣

Relevant: http://www.unicode.org/Public/emoji/5.0/emoji-variation-sequences.txt

Bit of context: I'm trying to patch this parser: https://github.com/Khan/simple-markdown/blob/master/simple-markdown.js#L1304 to break on emojis, because currently it matches as much text as it can. Without that matching/replacing emoji via that parser is problematic. Removing \u00c0-\uffff from the highlighted regex accomplishes what I need, but parser starts breaking up words. Some languages (cyrrillic) get broken per letter, which is not good for performance. I need to either patch that regex to allow letters, but not emojis, or put a regex that catches all text before it.

Edit: Added some examples

Edit: Added language restriction

Upvotes: 4

Views: 1420

Answers (3)

Mad Bernard
Mad Bernard

Reputation: 372

In JavaScript before ES2018 (which got added natively to many browsers in mid-2020), the answer is "roll your own" 😱

Here is what I made, after consulting Wikipedia and using this SO answer for cleaning up the endless list of unicode codes:

const westernEurope = '\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u01BF';
// (u00D7 and u00F7 are math symbols)
const cyrillic = '\u0400-\u04FF';
const japan = '\u30A0-\u30FF';
const chinese = '\u4E00-\u9FA5';

const re = new RegExp(`^[a-zA-Z${westernEurope + cyrillic + japan + chinese}]*$`, 'g');

You should also consult Wikipedia if you need other languages or want to double check this (for instance, I only included basic Cyrillic in the cyrillic codes above)

If you can use the latest JavaScript in your project, this answer explains how Unicode Property Escapes are just what we need

Upvotes: 0

Max
Max

Reputation: 1219

I found a solution here: https://mathiasbynens.be/notes/es-unicode-property-escapes#word

Essentially /[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u given Unicode property escapes support.

Until \p is natively supported in JavaScript, you can transpile this regex.

Upvotes: 3

Tim Pietzcker
Tim Pietzcker

Reputation: 336178

\pL matches a Unicode letter.

You might want to combine that Unicode category with \p{Pc} (connector punctuation) to also catch word combinations like it's or doesn't by using a character class: [\pL\p{Pc}]

Upvotes: 0

Related Questions