Yu Chen
Yu Chen

Reputation: 7490

Using regex to capture dialogue of a character in Shakespeare

I'm trying to use regex to capture Shakespeare dialogue to practice using regex for text matching. For instance, I want to capture all the text spoken by a character called CALIBAN in this particular scene:

  PROSPERO. Thou most lying slave,
    Whom stripes may move, not kindness! I have us'd thee,
    Filth as thou art, with human care, and lodg'd thee
    In mine own cell, till thou didst seek to violate
    The honour of my child.

  CALIBAN. O ho, O ho! Would't had been done.
    Thou didst prevent me. I had peopl'd else
    This isle with Calibans.

  PROSPERO. Thou most lying slave,
    Whom stripes may move, not kindness! I have us'd thee,
    Filth as thou art, with human care, and lodg'd thee
    In mine own cell, till thou didst seek to violate
    The honour of my child.

  CALIBAN. O ho, O ho! Would't had been done.
    Thou didst prevent me. I had peopl'd else
    This isle with Calibans.

I'd like to capture

O ho, O ho! Would't had been done.
        Thou didst prevent me. I had peopl'd else
        This isle with Calibans.

How would I use regex to accomplish this? I tried this particular regex:

(?<=\n  CALIBAN\. )[A-Za-z ',\.\n\!-]+(?=\n  PROSPERO\. |$)

Note: in the actual text, there's always 2 white space characters, and then the new character's name. Each line has a carriage return at the end of it. My regex looks for CALIBAN. to start, then matches some text, and ensures that it must end with PROSPERO.. However, when I plug this into regexp.com, I have my entire text matched: enter image description here

Upvotes: 0

Views: 148

Answers (2)

anubhava
anubhava

Reputation: 785521

You may use this regex with lazy quantifier:

(?<=\n  CALIBAN\. )[A-Za-z\s',.!-]+?(?=\n  PROSPERO\. |$)

Updated Regex Demo

In PHP use:

$re = '/(?<=\n  CALIBAN\. )[A-Za-z\s\',.!-]+?(?=\n  PROSPERO\. |$)/';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the result
print_r($matches[0]);

Upvotes: 2

Christopher Hart
Christopher Hart

Reputation: 60

Try using the following regex:

CALIBAN. ((.*\n .*)*)

The first capture group (group 1) will match the text spoken by Caliban without including his name. Based upon the provided example, this regex should work.

Upvotes: 1

Related Questions