Amit Joki
Amit Joki

Reputation: 59252

Why does underscore comes under \w?

This may be a theoretical question.

Why does underscore _ comes under \w in regex and not under \W

I hope this isn't primarily opinion based, because there should be a reason.

Citation would be great, if at all available.

Upvotes: 8

Views: 700

Answers (2)

Jerry
Jerry

Reputation: 71538

\w matches any single code point that has any of the following properties:

  • \p{GC=Alphabetic} (letters and some more unicode points)

  • \p{GC=Mark} (Mark: Spacing, non-spacing, enclosing)

  • \p{GC=Connector_Punctuation} (e.g. underscore)

  • \p{GC=Decimal_Number} (numbers and other variants of numbers)

  • \p{Join_Control} (code points U+0200C and U+0200D)

These properties are used in the composition of programming language identifiers in scripts. For instance[1]:

The Connector Punctuation (\p{GC=Connector_Punctuation}) is added in for programming language identifiers, thus adding "_" and similar characters.

There is a[2]:

general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores.

The \p{Join_Control} was actually recently added to the character class \w as well and here's a message that perl devs exchanged for its implementation, supporting my earlier mention that \w is used to compose identifiers.

Upvotes: 2

johnsyweb
johnsyweb

Reputation: 141810

From Wikipedia's Regular expression article (emphasis mine):

An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers.

In , and , this non-standard class is represented by \w (and characters outside this class are represented by \W).

Upvotes: 8

Related Questions