diasks2
diasks2

Reputation: 2142

Ruby regex to split text

I am using the below regex to split a text at certain ending punctuation however it doesn't work with quotes.

text = "\"Hello my name is Kevin.\" How are you?"
text.scan(/\S.*?[...!!??]/)

=> ["\"Hello my name is Kevin.", "\" How are you?"]

My goal is to produce the following result, but I am not very good with regex expressions. Any help would be greatly appreciated.

=> ["\"Hello my name is Kevin.\"", "How are you?"]

Upvotes: 0

Views: 117

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89649

text.scan(/"(?>[^"\\]+|\\{2}|\\.)*"|\S.*?[...!!??]/)

The idea is to check for quoted parts before. The subpattern is a bit more elaborated than a simple "[^"]*" to deal with escaped quotes (* see at the end to a more efficient pattern).

pattern details:

"             # literal: a double quote
(?>           # open an atomic group: all that can be between quotes
    [^"\\]+   # all that is not a quote or a backslash
  |           # OR
    \\{2}     # 2 backslashes (the idea is to skip even numbers of backslashes)
  |           # OR
    \\.       # an escaped character (in particular a double quote)
)*            # repeat zero or more times the atomic group
"             # literal double quote
|             # OR
\S.*?[...!!??]

to deal with single quote to you can add: '(?>[^'\\]+|\\{2}|\\.)*'| to the pattern (the most efficient), but if you want make it shorter you can write this:

text.scan(/(['"])(?>[^'"\\]+|\\{2}|\\.|(?!\1)["'])*\1|\S.*?[...!!??]/)

where \1 is a backreference to the first capturing group (the found quote) and (?!\1) means not followed by the found quote.

(*) instead of writing "(?>[^"\\]+|\\{2}|\\.)*", you can use "[^"\\]*+(?:\\.[^"\\]*)*+" that is more efficient.

Upvotes: 2

falsetru
falsetru

Reputation: 369494

Add optional quote (["']?) to the pattern:

text.scan(/\S.*?[...!!??]["']?/)
# => ["\"Hello my name is Kevin.\"", "How are you?"]

Upvotes: 1

Related Questions