Reputation: 35
I'm currently running into trouble with what I thought would be a simple task.
If I have a string like:
Sentence 1 “double quoted phrase” sentence 1. Sentence 2? Sentence 3 (numbers in parentheses like 1.2 should not be split). Sentence 4 ' single quoted phrase. rest of quote' sentence 4. Sentence 5!
I want to split it into:
Sentence 1 “double quoted phrase” sentence 1.
Sentence 2?
Sentence 3 (numbers in parentheses like 1.2 should not be split).
Sentence 4 ' single quoted phrase. rest of quote' sentence 4.
Sentence 5!
Obviously a simple "\.|\?|!"
match won't work. Any help is appreciated.
I realize regexes might not be the best tool for this, but unless there's another quick easy solution that I'm missing, I'm past the point of no return.
Upvotes: 0
Views: 176
Reputation: 195039
I am not sure if it is a job for regex.
but take a look this regex (with sed):
sed -r 's/([.?!]) +([A-Z])|\1$/\1\n\2/g' file
it outputs:
Sentence 1 “double quoted phrase” sentence 1.
Sentence 2?
Sentence 3 (numbers in parentheses like 1.2 should not be split).
Sentence 4 ' single quoted phrase. rest of quote' sentence 4.
Sentence 5!
However it is not perfect. If you change the rest
in sentence 4 to Rest
it fails.
The problem is, you have to check, if the .!?
wrapped by "",'',(),[],{}....
it is not an ending of sentence. However the worse part is, for example, I would write a sentence:
The dot ". is a period.
notice that I forgot (a mistake) the closing quote. or the following (two sentences):
Why not put a brace "(" there ? The closing brace ")" is missing its partner.
How can your program (by regex) know this should be two sentences?
Upvotes: 1