deruse
deruse

Reputation: 2881

Regex to match punctuation followed by space with some exceptions

I am trying to come up with a regex which matches punctuation (!, ?, and .) followed by a space. I want to NOT match periods which are preceded by salutations like "Mr.", "Mrs.", etc...

Doing the first part is simple enough: r"[\?|!|\.] "

But I am struggling with the second part. Here is what I have so far: r"(?<=[^(Mr|Ms)])\. "

The second one does NOT match something like "radar. " or "cups. " or "loom. " which is bad. I am also having trouble combining both those regexes into a single one.

Thanks.

Upvotes: 0

Views: 565

Answers (3)

cs95
cs95

Reputation: 402413

This should work:

(?<!(Mr)|(Ms))(?<!(Mrs))[.!?](?=\s|$)

Here's a demo:

In [19]: re.search(r'(?<!(Mr)|(Ms))(?<!(Mrs))[.](?=\s|$))', 'Mrs. Jones!').group(0)
Out[19]: '!'

There's a negative lookbehind for Mr and Mrs, and a positive lookahead for either a space or EOL.

Please note that each separate salutation of different length will needs its own lookbehind.


Edited, as per OP's request:

In [78]: re.search(r'((?<!(Mr)|(Ms))(?<!(Mrs))[.])|([!?])(?=\s|$)', 'Mrs! Jones').group(0)
Out[78]: '!'

Upvotes: 1

dawg
dawg

Reputation: 103814

If want to be complete, you would need to exclude Prof, Dr, Miss, Mrs, Ms, Mr etc.

Python's re module does not allow for anything other than fixed width lookbacks; therefor, you would need to do multiple lookbacks for each width:

r'(?<!\bMr|\bDr)(?<!Mrs)(?<!\bProf|\bMiss)([.,;])(?= |\n|\Z)'

Demo

Or use the regex module that would allow variable width lookback assertions. Then you can do:

r'(?<!\bMr|\bMrs|\bDr|\bMiss|\bProf)([.,;])(?= |\n|\z)'

Demo


Side note: Anything inside a character class matches a single character. That is why you get unexpected results with [^(Mr|Ms)] That is negated character class for the individual characters of the set Mrs|()

Demo

Upvotes: 0

MotKohn
MotKohn

Reputation: 3955

Here is a working one: https://regex101.com/r/iRNTMY/2

(?<!(Mr|Ms))(?<!(Mrs))[.?!]

It uses negative look-behind twice for the two different length possibilities.

Upvotes: 1

Related Questions