lulalala
lulalala

Reputation: 17981

Split text into sentences, but skip quoted content

I want to split some text into sentences using regular expression (using Ruby). It does not need to be accurate, so cases such as "Washington D.C." can be ignored.

However I have an requirement that, if the sentence is quoted (by single or double quotes), then it should be ignored.

Say I have the following text:

Sentence One. "Wow." said Alice. Senetence Three.

It should be split into three sentences:

Sentence One.
"Wow." said Alice.
Sentence Three.

Currently I have content.scan(/[^\.!\?\n]*[\.!\?\n]/), but I have problem with quotes.

UPDATE:

The current answer can hit some performance issue. Try the following:

'Alice stood besides the table. She looked towards the rabbit, "Wait! Stop!", said Alice'.scan(regexp)

Would be nice if someone can figure out how to avoid it. Thanks!

Upvotes: 4

Views: 308

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336308

How about this:

result = subject.scan(
    /(?:      # Either match...
     "[^"]*"  # a quoted sentence
    |         # or
     [^".!?]* # anything except quotes or punctuation.
    )++       # Repeat as needed; avoid backtracking
    [.!?\s]*  # Then match optional punctuation characters and/or whitespace./x)

Upvotes: 9

Related Questions