Phil
Phil

Reputation: 35

Splitting a string up into sentences by common punctuation

I'm currently running into trouble with what I thought would be a simple task.

If I have a string like:

Sentence 1 “double quoted phrase” sentence 1. Sentence 2? Sentence 3 (numbers in parentheses like 1.2 should not be split). Sentence 4 ' single quoted phrase. rest of quote' sentence 4. Sentence 5!

I want to split it into:

Sentence 1 “double quoted phrase” sentence 1.

Sentence 2?

Sentence 3 (numbers in parentheses like 1.2 should not be split).

Sentence 4 ' single quoted phrase. rest of quote' sentence 4.

Sentence 5!

Obviously a simple "\.|\?|!" match won't work. Any help is appreciated.

I realize regexes might not be the best tool for this, but unless there's another quick easy solution that I'm missing, I'm past the point of no return.

Upvotes: 0

Views: 176

Answers (2)

VladL
VladL

Reputation: 13033

Try following regex

(?:^|\s).+?[.!?](?:\s|$)

Upvotes: 1

Kent
Kent

Reputation: 195039

I am not sure if it is a job for regex.

but take a look this regex (with sed):

 sed -r 's/([.?!]) +([A-Z])|\1$/\1\n\2/g' file

it outputs:

Sentence 1 “double quoted phrase” sentence 1.
Sentence 2?
Sentence 3 (numbers in parentheses like 1.2 should not be split).
Sentence 4 ' single quoted phrase. rest of quote' sentence 4.
Sentence 5!

However it is not perfect. If you change the rest in sentence 4 to Rest it fails.

The problem is, you have to check, if the .!? wrapped by "",'',(),[],{}.... it is not an ending of sentence. However the worse part is, for example, I would write a sentence:

The dot ". is a period.

notice that I forgot (a mistake) the closing quote. or the following (two sentences):

Why not put a brace "(" there ? The closing brace ")" is missing its partner.

How can your program (by regex) know this should be two sentences?

Upvotes: 1

Related Questions