Reputation: 353
I am learning regex. One of the problem requires me to find all words that begin with a vowel. I am using Python's re
module for evaluating the regular expression.
Here is the regex I made:
\<[aeiouAEIOU].*?\>
The above regex does not work with the \<
and the \>
anchor but works with the \b
anchor. Why?
Upvotes: 2
Views: 1180
Reputation: 626919
Python re
does not support "leading/starting word boundary" \<
construct (in other regex flavors, also \m
or [[:<:]]
), nor the "closing/trailing word boundary", \>
(in other regex flavors, also \M
or [[:>:]]
).
Note that leading and trailing word boundaries are not supported by most NFA, often referred to as "modern", regex engines. The usual way is to use \b
, as you have already noticed, because it is more convenient.
However, this convenience comes with a price: \b
is a context-depending pattern. This problem has been covered very broadly on SO, here is my answer covering some aspects of \b
, see Word boundary with words starting or ending with special characters gives unexpected results.
So, if you plan to use \<
or \>
, you need to implement them manually like this:
\<
= a position at a word boundary where the char to the right is a word char, i.e. \b(?=\w)
.\>
= a position at a word boundary where the char to the left is a word char, i.e. \b(?<=\w)
.This is how these word boundary variants are handled in the PCRE library:
COMPATIBILITY FEATURE FOR WORD BOUNDARIES
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly syntax
[[:<:]]
and[[:>:]]
is used for matching "start of word" and "end of word". PCRE treats these items as follows:
[[:<:]]
is converted to\b(?=\w)
[[:>:]]
is converted to\b(?<=\w)
Upvotes: 1
Reputation: 189467
"Does not work" is not correct; one works in some regex dialects, the other in others.
Most "modern" regex dialects (Python, Perl, Ruby, etc) use \b
as the word boundary, on both sides.
More traditional regex dialects, like the original egrep
, use \<
as the left word boundary operator, and \>
on the right.
(Strictly speaking, Al Aho's original egrep
did not have word boundaries; this feature was added later. Maybe see https://stackoverflow.com/a/39367415/874188 for a one-minute summary of regex history.)
Upvotes: 2