Reputation: 559
I'm trying to get my head around regex and how to use it to split a string containing a paragraph into sentences.
Edit: If i have the text:
Hello, my name is Mr. Bob. I am 15.2 months old. Can you believe that? No... Oh well.
I want it to turn into
Hello, my name is Mr. Bob.
I am 15.2 months old.
Can you believe that?
No... Oh well.
Upvotes: 1
Views: 195
Reputation: 7227
So you need a .
or ?
followed by one or more whitespaces: [.?]\s+
. Further, you do not want to split on multiple dots. For that I would use negative lookback:
(?<!\.)[.?]\s+
Then there is the problem of titles. You can include those in the negative lookback too. The caveat is that negative lookback groups must always match the same number of characters, so we simply use a plain .
to 'pad' the lookback in the cases where we need it:
print re.sub(r'(?<!..\.|.Mr|Mrs|.Ms)[.?]\s+', '\\g<0>\n', s)
Hello, my name is Mr. Bob.
I am 15.2 months old.
Can you believe that?
'No... Oh well.
Notice how we use .
to pad our lookback to 3 characters. \.
gets turned into ..\.
, Mr
into .Mr
, etc. We need 3 characters as that's the length of our longest lookback: Mrs
. In the replacement, \\g<0>
is expanded to the whole string matched by the replacement, not including the lookback.
From here, it should be straightforward to extend the regex to suit your needs. One final point is that I'd remove all newline characters from the paragraph before running the regular expressions. This may not exactly be necessary, but seems prudent as .
does not match newlines per default.
Upvotes: 2