Reputation: 11
I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end? I'm asking because it seems that it's always necessary to have \s to indicate the end of the word, therefore eliminating the need to have \b. Like in the case below, one with a '\b' to end the inner group, the other without, and they get the same result.
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
m = re.search(r'(\b\w+)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
Upvotes: 0
Views: 459
Reputation: 41872
"I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end?"
\b
is never required to represent the end, or beginning, of a word. To answer your bigger question, it's only useful during development -- when working with natural language, you'll ultimately need to replace \b
with something else. Why?
The \b operator matches a word boundary as you've discovered. But a key concept here is, "What is a word?" The answer is the very narrow set [A-Za-z0-9_]
-- word is not a natural language word but a computer language identifier. The \b operator exists for a formal language's parser.
This means it doesn't handle common natural language situations like:
The word let's becomes two words, 'let' & 's' if `\b' represents the boundaries of a word. Also consider titles like Mr. & Mrs. lose their period.
Similarly, if `\b' represents the start of a word, then the appostrophe in these cases will be lost: 'twas 'bout 'cause
Hyphenated words suffer at the hand of `\b' as well, e.g mother-in-law (unless you want her to suffer.)
Unfortunately, you can't simply augment \b
by including it in a character set as it doesn't represent a character. You may be able to combine it with other characters via alternation in a zero-width assertion.
When working with natural language, the \b
operator is great for quickly prototyping an idea, but ultimately, probably not what you want. Ditto \w
, but, since it represents a character, it's more easily augmented.
Upvotes: 0
Reputation: 75232
It's not because it's at the end of the word, it's because you know what comes after the word. In your example:
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
...the first \b
is necessary to prevent a match starting with the in
in begin
. The second one is redundant because you're explicitly matching the non-word characters (\s+
) that follow the word. Word boundaries are for situations where you don't know what the character on the other side will be, or even if there will be a character there.
Where you should be using another one is at the end of the regex. For example:
m = re.search(r'(\b\w+)\s+\1\b', "Let's go to the theater")
Without the second \b
, you would get a false positive for the theater
.
Upvotes: 1
Reputation: 309929
Consider wanting to match the word "march":
>>> regex = re.compile(r'\bmarch\b')
It can come at the end of the sentence...
>>> regex.search('I love march')
<_sre.SRE_Match object at 0x10568e4a8>
Or the beginning ...
>>> regex.search('march is a great month')
<_sre.SRE_Match object at 0x10568e440>
But if I don't want to match things like marching
, word boundaries are the most convenient:
>>> regex.search('my favorite pass-time is marching')
>>>
You might be thinking "But I can get all of these things using r'\s+march\s+'
" and you're kind of right... The difference is in what matches. With the \s+
, you also might be including some whitespace in the match (since that's what \s+
means). This can make certain things like search for a word and replace it more difficult because you might have to manage keeping the whitespace consistent with what it was before.
Upvotes: 2
Reputation: 47790
\s
is just whitespace. You can have word boundaries that aren't whitespace (punctuation, etc.) which is when you need to use \b
. If you're only matching words that are delimited by whitespace then you can just use \s
; and in that case you don't need the \b
.
import re
sentence = 'Non-whitespace delimiters: Commas, semicolons; etc.'
print(re.findall(r'(\b\w+)\s+', sentence))
print(re.findall(r'(\b\w+\b)+', sentence))
Produces:
['whitespace']
['Non', 'whitespace', 'delimiters', 'Commas', 'semicolons', 'etc']
Notice how trying to catch word endings with just \s
ends up missing most of them.
Upvotes: 2