fedorqui
fedorqui

Reputation: 289735

Optional dot in regex

Say I want to replace all the matches of Mr. and Mr with Mister.

I am using the following regex: \bMr(\.)?\b to match either Mr. or just Mr. Then, I use the re.sub() method to do the replacement.

What is puzzling me is that it is replacing Mr. with Mister.. Why is this keeping the dot . at the end? It looks like it is not matching the Mr\. case but just Mr.

import re
s="a rMr. Nobody Mr. Nobody is Mr Nobody and Mra Nobody."
re.sub(r"\bMr(\.)?\b","Mister", s)

Returns:

'a rMr. Nobody Mister. Nobody is Mister Nobody and Mra Nobody.'

I also tried with the following, but also without luck:

re.sub(r"\b(Mr\.|Mr)\b","Mister", s)

My desired output is:

'a rMr. Nobody Mister Nobody is Mister Nobody and Mra Nobody.'
                     ^                              ^
                     no dot            this should be kept as it is

Upvotes: 20

Views: 3949

Answers (6)

jonrsharpe
jonrsharpe

Reputation: 122032

I think you want to capture 'Mr' followed by either a '.' or a word boundary:

r"\bMr(?:\.|\b)"

In use:

>>> import re
>>> re.sub(r"\bMr(?:\.|\b)", "Mister", "a rMr. Nobody Mr. Nobody is Mr Nobody and Mra Nobody.")
'a rMr. Nobody Mister Nobody is Mister Nobody and Mra Nobody.'

Upvotes: 16

Donagh McCarthy
Donagh McCarthy

Reputation: 141

re.sub(r'\bMr[\s\.]', 'Mister ', s)

If this was Code Golf would I win?

Upvotes: 3

twasbrillig
twasbrillig

Reputation: 18841

I think that in the original post the \b was the cause of some of the confusion.

From regex101:

\b matches, without consuming any characters, immediately between a character matched by \w and a character not matched by \w (in either order).

and

\w matches any letter, number or underscore.

The OP expected the \b to match the boundary between the dot and the space following it. But it didn't, because the dot is not matched by \w. Instead the \b matched the boundary between the "Mr" text and the dot. That caused the dot to not be captured which was what the OP was asking about. This can be seen here:

enter image description here

Upvotes: 7

Irshad Bhat
Irshad Bhat

Reputation: 8709

>>> s="a rMr. Nobody Mr. Nobody is Mr Nobody and Mra Nobody."
>>> re.sub(r'\b(Mr[\.\s]\s*)',r'Mister ',s)
'a rMr. Nobody Mister Nobody is Mister Nobody and Mra Nobody.'

Upvotes: 3

vks
vks

Reputation: 67968

re.sub(r"\bMr\.|\bMr\b","Mister", s)

Try this.You need to remove \b after .

Output:a rMr. Nobody Mister Nobody is Mister Nobody and Mra Nobody.'

The reason why \bMr(\.)?\b is not working because between . and space there is no word boundary.

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

Upvotes: 7

twasbrillig
twasbrillig

Reputation: 18841

@jonsharpe's answer works, but this one is a bit simpler: \bMr(\.|\b)

http://regex101.com/r/sC9nG6/2

Upvotes: 0

Related Questions