Reputation: 17981
I want to split some text into sentences using regular expression (using Ruby). It does not need to be accurate, so cases such as "Washington D.C." can be ignored.
However I have an requirement that, if the sentence is quoted (by single or double quotes), then it should be ignored.
Say I have the following text:
Sentence One. "Wow." said Alice. Senetence Three.
It should be split into three sentences:
Sentence One.
"Wow." said Alice.
Sentence Three.
Currently I have content.scan(/[^\.!\?\n]*[\.!\?\n]/)
, but I have problem with quotes.
UPDATE:
The current answer can hit some performance issue. Try the following:
'Alice stood besides the table. She looked towards the rabbit, "Wait! Stop!", said Alice'.scan(regexp)
Would be nice if someone can figure out how to avoid it. Thanks!
Upvotes: 4
Views: 308
Reputation: 336308
How about this:
result = subject.scan(
/(?: # Either match...
"[^"]*" # a quoted sentence
| # or
[^".!?]* # anything except quotes or punctuation.
)++ # Repeat as needed; avoid backtracking
[.!?\s]* # Then match optional punctuation characters and/or whitespace./x)
Upvotes: 9