Exn
Exn

Reputation: 809

What's a good regex to include accented characters in a simple way?

Right now my regex is something like this:

[a-zA-Z0-9] but it does not include accented characters like I would want to. I would also like - ' , to be included.

Upvotes: 55

Views: 61162

Answers (5)

chichilatte
chichilatte

Reputation: 1808

@NightCoder's answer works perfectly in PHP:

    \p{L}\p{M}

and with no brittle whitelists. Note that to get it working in javascript you need to add the unicode u flag. Useful to have a working example in javascript...

const text = `Crêpes are øh-so déclassée`
[ ...text.matchAll(  /[-'’\p{L}\p{M}\p{N}]+/giu  ) ]

will return something like...

[
    {
        "0": "Crêpes",
        "index": 0
    },
    {
        "0": "are",
        "index": 7
    },
    {
        "0": "øh-so",
        "index": 11
    },
    {
        "0": "déclassée",
        "index": 17
    }
]

Here it is in a playground... https://regex101.com/r/ifgH4H/1/

And also some detail on those regex unicode categories... https://javascript.info/regexp-unicode

Upvotes: 5

NightCoder
NightCoder

Reputation: 1139

You put in your expression:

\p{L}\p{M}

This in Unicode will match:

  • any letter character (L) from any language
  • and marks (M)(i.e, a character that is to be combined with another: accent, etc.)

Upvotes: 32

just.jules
just.jules

Reputation: 99

A version without the exclusion rules:

^[-'a-zA-ZÀ-ÖØ-öø-ÿ]+$

Explanation

  • The ^ anchor asserts that we are at the beginning of the string
  • [...] allows dash, apostrophe, digits, letters, and chars in a wide accented range,
  • The + matches that one or more times
  • The $ anchor asserts that we are at the end of the string

Reference

Upvotes: 6

zx81
zx81

Reputation: 41838

Accented Characters: DIY Character Range Subtraction

If your regex engine allows it (and many will), this will work:

(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$

Please see the demo (you can add characters to test).

Explanation

  • (?i) sets case-insensitive mode
  • The ^ anchor asserts that we are at the beginning of the string
  • (?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ]) matches one character...
  • The lookahead (?![×Þß÷þø]) asserts that the char is not one of those in the brackets
  • [-'0-9a-zÀ-ÿ] allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract
  • The + matches that one or more times
  • The $ anchor asserts that we are at the end of the string

Reference

Extended ASCII Table

Upvotes: 43

Brian Stephens
Brian Stephens

Reputation: 5261

Use a POSIX character class (http://www.regular-expressions.info/posixbrackets.html):

[-'[:alpha:]0-9] or [-'[:alnum:]]

The [:alpha:] character class matches whatever is considered "alphabetic characters" in your locale.

Upvotes: 4

Related Questions