Paul de Lange
Paul de Lange

Reputation: 10633

regex explained in english

I have looked here and from what I understand the following regex simply means "any unicode character sequence". Can someone confirm this please?

Current Regex: /^(?>\P{M}\p{M}*)+$/u

Also if I read the manual it says

a) \P{M} = \PM

b) (?>\PM\pM*) = \X

So with these two things in hand, can I not simplify the regex to?:

Proposed Regex: /^\X+$/u

Which I still don't actually understand...

Upvotes: 1

Views: 212

Answers (2)

Bart Kiers
Bart Kiers

Reputation: 170288

Yes, \P{M}\p{M}* could be simplified to \X, but not all languages support \X while (in my experience) \P{M} and \p{M} are supported more frequently.

For example, Java's and .NET's regex engines do not support \X (Perl does, of course...).

More info, see: http://www.regular-expressions.info/unicode.html

Upvotes: 2

beerbajay
beerbajay

Reputation: 20300

^            # start of string followed by 
(?>          # an independent (non-backtracking) capturing group containing 
    \P{M}    # a single unicode character which is not in the `Mark` category
    \p{M}*   # 0 or more characters in the `Mark` category
)+           # with this capturing group repeated 1 or more times
$            # the end-of-line

Whereas ^\X+$ contains no capturing group; the \P{M}\p{M}* is otherwise equivalent.

Upvotes: 2

Related Questions