Reputation: 407
Why does this regular expression:
r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$'
does not match J. F. Kennedy
?
I have to remove \b
in groups first_init
and mid_init
to match the words.
I am using Python. And for testing i am using https://regex101.com/
Thanks
Upvotes: 1
Views: 1470
Reputation: 43169
Just remove the second boundary:
^(?P<first_init>\b\w\.)\s
(?P<mid_init>\b\w\.)\s
(?P<last_name>\b\w+\b)$
And see a demo on regex101.com.
Background is that the second \b
is between a dot and a space, so it fails (remember that one of the sides needs to be a word character, ie one of a-zA-Z0-9_
)
Upvotes: 1
Reputation: 1341
It does not match because of the \.
(dot) character. A word boundary does not include the dot (it is not the same definition of word you perhaps would like). You can easily rewrite it without the need of \b
. Read the documentation carefully.
Upvotes: 1
Reputation: 506
\b means border of a word.
Word
here is defined like so:
A word ends, when there is a space character following it.
"J.", "F." and "Kennedy" are the words here.
You're example is trying to search for a space between the letter and the dot and it is searching for J . F . Kennedy
.
Upvotes: 0
Reputation: 11591
\b
matches the empty string only at the beginning or end of a word. A word is a sequence of alphanumeric or underscore characters. The dot (.
) cannot comprise part of the word.
>>> import re
# does not match when \. is within word boundary
>>> re.match(r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')
# matches when \b is moved to left of \.
>>> re.match(r'^(?P<first_init>\b\w\b\.)\s(?P<mid_init>\b\w\b\.)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy') # matches
The .
is not part of the word in this sense. See the docs here.
Upvotes: 1
Reputation: 22457
You are over-applying the \b
word breaks.
\b
will only match if on one side there is a valid "word" character and on the other side not. Now you use this construction twice:
\b\w\.\b\s
.. and, rightly so, it does not match because on the left side you have a not-word character (a single full stop) and on the other side you also have a not-word character (a space).
Removing the \b
between the full stop and \s
is enough to make it work.
Upvotes: 3