wfgeo
wfgeo

Reputation: 3118

Regex match characters when not preceded by a string

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:

I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith

I am using this with the re.split function in Python 3 I want to get this:

["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]

This is currently my regex:

(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)

I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.

I am trying to use something like:

(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)

But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?

Here is a regexr of my situation: https://regexr.com/4sgcb

Upvotes: 3

Views: 1680

Answers (4)

Benjamin Basmaci
Benjamin Basmaci

Reputation: 2567

As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".

However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.


1. Identify your edge cases

For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)

2. Mask your edge cases

For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"

3. Run your algorithm

Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s

4. Unmask your edge cases

Turn "======NUMBER======" back into "No."

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.

Use a pattern like

\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))

See the regex demo

It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.

Python demo:

import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
    print(m)

Output:

I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith

Pattern details

  • \s* - matches 0 or more whitespace (used to trim the results)
  • (?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
    • \d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
    • (?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
    • \.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
    • | - or
    • [^.!?] - any character but a ., !, and ?
  • (?:[.?!]|$) - a ., !, and ? or end of string.

Upvotes: 2

Enlico
Enlico

Reputation: 28520

This is the closest regex I could get (the trailing space is the one we match):

(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *) 

which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).

This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):

\(\(No\|Sgt\|\.\w\)\@<![?.!]\)\( *\d\+ *\)\@!\zs 

For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195653

Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.

Myself I would do it with three steps:

  1. Replace spaces that should stay with some special character (re.sub)
  2. Split the text (re.split)
  3. Replace the special character with space

For example:

import re

zero_width_space = '\u200B'

s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'

s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)

from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])

Prints:

['I am from New York, N.Y. and I would like to say hello!',
 'How are you today?',
 'I am well.',
 'I owe you $6. 00 because you bought me a No. 3 burger.',
 '-Sgt. Smith']

Upvotes: 1

Related Questions