Reputation: 11321
A bunch of us English grad students are studying dialog in Virginia Woolf's novel The Waves, and I've been trying to mark up the novel in TEI. To do this, it would be useful to write a regex that captures the dialog. Thankfully, The Waves is extremely regular, and almost all the dialog is in the form:
'Now they have all gone,' said Louis. 'I am alone. They have gone into the house for breakfast,'
But could continue for several paragraphs. I'm trying to write a regex to match all the paragraphs of a given speaker.
This is discussed briefly in Chris Foster's blog post, where he suggests something like /'([\^,]+,)' said Louis, '(*)'/
, although this would only match single paragraphs, I think. This is how I'm thinking through it:
I could probably do this with a ton of awkward python, but I'd love to know whether this is possible with regex.
Upvotes: 1
Views: 115
Reputation: 30283
It seems, from your link, that the text follows the following rules.
\n
.\n\n+
.'
is used to demarcate speech.Here's a quick attempt (scroll all the way down to view the match groups)—flawed, I'm sure—but there's enough here that should lead you in the right direction. Note how if you concatenate the three capture groups, idiomatically known as $1
, $2
, and $3
, you get each character's speech, including punctuation between the "said" separator. However, notice how certain quirks of language throw this regular expression off—for example, the fact that we do not close quotes at the end of paragraphs, yet open new quotes if the speech continues into the next paragraph, throws off the whole balanced-quotes strategy—and so do apostrophes.
\n\n.*?'([^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:([.]) |, )'([^^]+?)'(?=[^']*(?:'[^']')*[^']*\n\n.*'(?:[^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:[.] |, ))
| | | <----><--> <>|<-------------------><------------>| <----> |<--------------------------------------------------------------------------------->
| | | | | | || | | | ||
| | | | | | || | | | |assert that this end-quote is followed by a string of non-quote characters, then
| | | | | | || | | | |zero or more strings of quoted non-quote characters, then another string of non-
| | | | | | || | | | |quote characters, a new paragraph, and the next "said Bernard"; otherwise fail.
| | | | | | || | | | |
| | | | | | || | | | match an (end-)quote
| | | | | | || | | |
| | | | | | || | | match any character as needed (but no more than needed)
| | | | | | || | |
| | | | | | || | match a (start-)quote
| | | | | | || |
| | | | | | || match either a period followed by two spaces, or a comma followed by one space
| | | | | | ||
| | | | | | |match the "said Bernard"
| | | | | | |
| | | | | | match an (end-)quote
| | | | | |
| | | | | match a comma, optionally
| | | | |
| | | | match a question mark, optionally
| | | |
| | | match any character as needed (but no more than needed)
| | |
| | match a (start-)quote
| |
| match as many non-newline characters as needed (but no more than needed)
|
new paragraph
Rubular matches (an excerpt):
Match 3
1. But when we sit together, close
2.
3. we melt into each
other with phrases. We are edged with mist. We make an
unsubstantial territory.
Match 4
1. I see the beetle
2. .
3. It is black, I see; it is green,
I see; I am tied down with single words. But you wander off; you
slip away; you rise up higher, with words and words in phrases.
Upvotes: 1