Reputation: 7032

Regular Expression to match all characters up to next match

I'm parsing text that is many repetitions of a simple pattern. The text is in the format of a script for a play, like this:

SAMPSON
I mean, an we be in choler, we'll draw.

GREGORY
Ay, while you live, draw your neck out o' the collar.

I'm currently using the pattern ([A-Z0-9\s]+)\s*\:?\s*[\r\n](.+)[\r\n]{2}, which works fine (explanation below) except for when the character's speech has line breaks in it. When that happens, the character's name is captured successfully but only the first line of the speech is captured.

Turning on Single-line mode (to include line breaks in .) just creates one giant match.

How can I tell the (.+) to stop when it finds the next character name and end the match?
I'm iterating over each match individually (JavaScript), so the name must be available to the next match.

Ideally, I would be able to match all characters until the entire pattern is repeated.

Pattern explained:

The first group matches a character's name (allowing capital letters, numbers, and whitespace), (with a trailing colon and whitespace optional).
The second group (character's speech) begins on a new line and captures any characters (except, problematically, line breaks and characters after them).
The pattern ends (and starts over) after a blank line.

Upvotes: 5

Answers (3)

Joanna Derks

Reputation: 4063

I finally managed to get it to match only what you wanted, i.e.
- the name of the character, allowing for whitespaces and the colon
- and, optionally multiline with linebreaks, the text associated with the person

You would need to do findAll using this regex - it is case sensitive:

((?:[A-Z]{2,}\s*:?\s*)+)\s+((?![A-Z]{2,}\s*:?\s*).+?[.?!]\s*)+

Explanation:

((?:[A-Z]{2,}\s*:?\s*)+) - the first group captures the upper case name of the person - it will match 'GREGOR' as well as 'MANFRED THE GREATEST:'
\s+ - at least one whitespace character
Then repeat at least once:
(?![A-Z]{2,}\s*:?\s*) - look ahead to check that the next text is not the upper case character name
.+?[.?!]\s* - match everything until you find a character that ends a sentence [.?!] and optionally whitespaces

Upvotes: 0

Chris Pitman

Reputation: 13104

Consider going a different direction with this. You really want to split a larger dialogue on any line that contains a name. You can do this with a regular expression still (replace the regex with whatever will match the "speaker" line):

results = "Insert script here".split(/^([A-Z]+)$/)

On a standards compliant implementation, you example text will end up in an array like so:

results[0] = ""
results[1] = "SAMPSON"      
results[2] = "I mean, an we be in choler, we'll draw.            
"
results[3] = "GREGORY"      
results[4] = "Ay, while you live, draw your neck out o' the collar. "

A caveat is that most browsers are spotty on the standard here. You can use the library XRegExp to get cross platform behaviour.

Upvotes: 1

Nathan

Reputation: 7032

Okay, I did a little tinkering and found something that works. It isn't super elegant, but it does the job.

([A-Z0-9\s]+)\s*\:?\s*[\r\n]((.+[\r\n]?.*)+)[\r\n]{2}

I modified the last capture group to allow endless repetitions of arbitrary text, a new line, and more arbitrary text. Since two line breaks in a row aren't allowed, the pattern ends after the speech.

Upvotes: 0

Regular Expression to match all characters up to next match

Answers (3)

Related Questions