Reputation: 2881
I am trying to come up with a regex which matches punctuation (!, ?, and .) followed by a space. I want to NOT match periods which are preceded by salutations like "Mr.", "Mrs.", etc...
Doing the first part is simple enough:
r"[\?|!|\.] "
But I am struggling with the second part. Here is what I have so far:
r"(?<=[^(Mr|Ms)])\. "
The second one does NOT match something like "radar. " or "cups. " or "loom. " which is bad. I am also having trouble combining both those regexes into a single one.
Thanks.
Upvotes: 0
Views: 565
Reputation: 402413
This should work:
(?<!(Mr)|(Ms))(?<!(Mrs))[.!?](?=\s|$)
Here's a demo:
In [19]: re.search(r'(?<!(Mr)|(Ms))(?<!(Mrs))[.](?=\s|$))', 'Mrs. Jones!').group(0)
Out[19]: '!'
There's a negative lookbehind for Mr and Mrs, and a positive lookahead for either a space or EOL.
Please note that each separate salutation of different length will needs its own lookbehind.
Edited, as per OP's request:
In [78]: re.search(r'((?<!(Mr)|(Ms))(?<!(Mrs))[.])|([!?])(?=\s|$)', 'Mrs! Jones').group(0)
Out[78]: '!'
Upvotes: 1
Reputation: 103814
If want to be complete, you would need to exclude Prof, Dr, Miss, Mrs, Ms, Mr
etc.
Python's re
module does not allow for anything other than fixed width lookbacks; therefor, you would need to do multiple lookbacks for each width:
r'(?<!\bMr|\bDr)(?<!Mrs)(?<!\bProf|\bMiss)([.,;])(?= |\n|\Z)'
Or use the regex module that would allow variable width lookback assertions. Then you can do:
r'(?<!\bMr|\bMrs|\bDr|\bMiss|\bProf)([.,;])(?= |\n|\z)'
Side note: Anything inside a character class matches a single character. That is why you get unexpected results with [^(Mr|Ms)]
That is negated character class for the individual characters of the set Mrs|()
Upvotes: 0
Reputation: 3955
Here is a working one: https://regex101.com/r/iRNTMY/2
(?<!(Mr|Ms))(?<!(Mrs))[.?!]
It uses negative look-behind twice for the two different length possibilities.
Upvotes: 1