Reputation: 59252
This may be a theoretical question.
Why does underscore _
comes under \w
in regex and not under \W
I hope this isn't primarily opinion based, because there should be a reason.
Citation would be great, if at all available.
Upvotes: 8
Views: 700
Reputation: 71538
\w
matches any single code point that has any of the following properties:
\p{GC=Alphabetic}
(letters and some more unicode points)
\p{GC=Mark}
(Mark: Spacing, non-spacing, enclosing)
\p{GC=Connector_Punctuation}
(e.g. underscore)
\p{GC=Decimal_Number}
(numbers and other variants of numbers)
\p{Join_Control}
(code points U+0200C and U+0200D)
These properties are used in the composition of programming language identifiers in scripts. For instance[1]:
The Connector Punctuation (
\p{GC=Connector_Punctuation}
) is added in for programming language identifiers, thus adding "_" and similar characters.
There is a[2]:
general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores.
The \p{Join_Control}
was actually recently added to the character class \w
as well and here's a message that perl devs exchanged for its implementation, supporting my earlier mention that \w
is used to compose identifiers.
Upvotes: 2
Reputation: 141810
From Wikipedia's Regular expression article (emphasis mine):
An additional non-POSIX class understood by some tools is
[:word:]
, which is usually defined as[:alnum:]
plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers.
In perl, tcl and vim, this non-standard class is represented by \w
(and characters outside this class are represented by \W
).
Upvotes: 8