user_3068807
user_3068807

Reputation: 407

Regular expressions with \b and non-word characters (like '.')

Why does this regular expression: r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$' does not match J. F. Kennedy?

I have to remove \b in groups first_init and mid_init to match the words. I am using Python. And for testing i am using https://regex101.com/

Thanks

Upvotes: 1

Views: 1470

Answers (5)

Jan
Jan

Reputation: 43169

Just remove the second boundary:

^(?P<first_init>\b\w\.)\s
(?P<mid_init>\b\w\.)\s
(?P<last_name>\b\w+\b)$

And see a demo on regex101.com.
Background is that the second \b is between a dot and a space, so it fails (remember that one of the sides needs to be a word character, ie one of a-zA-Z0-9_)

Upvotes: 1

Cyb3rFly3r
Cyb3rFly3r

Reputation: 1341

It does not match because of the \. (dot) character. A word boundary does not include the dot (it is not the same definition of word you perhaps would like). You can easily rewrite it without the need of \b. Read the documentation carefully.

Upvotes: 1

ForceMagic
ForceMagic

Reputation: 506

\b means border of a word.

Word here is defined like so:

A word ends, when there is a space character following it.

"J.", "F." and "Kennedy" are the words here.

You're example is trying to search for a space between the letter and the dot and it is searching for J . F . Kennedy.

Upvotes: 0

Justin O Barber
Justin O Barber

Reputation: 11591

\b matches the empty string only at the beginning or end of a word. A word is a sequence of alphanumeric or underscore characters. The dot (.) cannot comprise part of the word.

>>> import re
# does not match when \. is within word boundary
>>> re.match(r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')
# matches when \b is moved to left of \.
>>> re.match(r'^(?P<first_init>\b\w\b\.)\s(?P<mid_init>\b\w\b\.)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')  # matches

The . is not part of the word in this sense. See the docs here.

Upvotes: 1

Jongware
Jongware

Reputation: 22457

You are over-applying the \b word breaks.

\b will only match if on one side there is a valid "word" character and on the other side not. Now you use this construction twice:

\b\w\.\b\s

.. and, rightly so, it does not match because on the left side you have a not-word character (a single full stop) and on the other side you also have a not-word character (a space).

Removing the \b between the full stop and \s is enough to make it work.

Upvotes: 3

Related Questions