user30938
user30938

Reputation: 31

How to break apart a play script with the form **Speaker: Dialogue** to get all of a character's dialogue into a single text block?

So far, I have imported the text:

tempest.v <- scan("data/plainText/tempest.txt", what="character", sep="\n")

Identified where all of the speaker positions begin:

speaker.positions.v <- grep('^[^\\s]\\w+:', tempest.v)

Added a marker at the end of the text:

tempest.v <- c(tempest.v, "END:")

Here's the part where I'm having difficulty (assuming what I've already done is useful):

    for(i in 1:length(speaker.positions.v)){
    if(i != length(speaker.positions.v)){
        speaker.name <- debate.v[speaker.positions.v[i]]
        speaker.name <- strsplit(speaker.name, ":")
        speaker.name <- unlist(speaker.name)
        start <- speaker.positions.v[i]+1
        end <- speaker.positions.v[i+1]-1
        speaker.lines.v <- debate.v[start:end]
  }
}

Now I have variable speaker.name that has, on the left-hand side of the split, the name of the character who is speaking. The right-hand side of the split is the dialogue only up through the first line break.

I set the start of the dialogue block at position [i]+1 and the end at [i+1]-1 (i.e., one position back from the beginning of the subsequent speaker's name).

Now I have a variable, speaker.lines.v with all of the lines of dialogue for that speaker for that one speech.

How can I collect all of Prospero's then Miranda's (then any other character's) dialogue into a single (list? vector? data frame?) for analysis?

Any help with this would be greatly appreciated.

Happy New Year!

--- *TEXT ---

*Miranda: If by your art, my dearest father, you have Put the wild waters in this roar, allay them. The sky, it seems, would pour down stinking pitch, But that the sea, mounting to the welkin's cheek, Dashes the fire out. O, I have suffered With those that I saw suffer -- a brave vessel,

Who had, no doubt, some noble creature in her, Dash'd all to pieces. O, the cry did knock Against my very heart. Poor souls, they perish'd. Had I been any god of power, I would Have sunk the sea within the earth or ere It should the good ship so have swallow'd and The fraughting souls within her.

Prospero: Be collected: No more amazement: tell your piteous heart There's no harm done.

Miranda: O, woe the day!

Prospero: No harm. I have done nothing but in care of thee, Of thee, my dear one, thee, my daughter, who Art ignorant of what thou art, nought knowing Of whence I am, nor that I am more better Than Prospero, master of a full poor cell, And thy no greater father.

Miranda: More to know Did never meddle with my thoughts.

Prospero: 'Tis time I should inform thee farther. Lend thy hand, And pluck my magic garment from me. So:

[Lays down his mantle]

Lie there, my art. Wipe thou thine eyes; have comfort. The direful spectacle of the wreck, which touch'd The very virtue of compassion in thee, I have with such provision in mine art So safely ordered that there is no soul— No, not so much perdition as an hair Betid to any creature in the vessel Which thou heard'st cry, which thou saw'st sink. Sit down; For thou must now know farther.

--- END TEXT ---

Upvotes: 2

Views: 188

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109894

I was interested in this question because I'm developing a series of tools for these types of tasks. Here is how to solve this problem using those tools.

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape", "trinker/qdapRegex")
pacman::p_load(dplyr)

pat <- '^[^\\s]\\w+:'

"tempest.txt" %>%                                                                                             
    readLines() %>%                                                                 
    {.[!grepl("^(---)|(^\\s*$)", .)]} %>%
    split_match(pat, regex=TRUE, include=TRUE) %>%
    textshape::combine() %>%
    {setNames(., sapply(., function(x) unlist(ex_default(x, pattern = pat))))} %>%
    bind_list("person") %>%
    mutate(content = gsub(pat, "", content)) %>%
    `[` %>%
    textshape::combine()

result

     person                                     content
1  Miranda:  If by your art, my dearest father, you ...
2 Prospero:  Be collected No more amazement tell you ..

To avoid combining (As @RichieCotton displays initially) leave off the last textshape::combine() in the chain.

Upvotes: 2

Richie Cotton
Richie Cotton

Reputation: 121087

We're going to use the rebus package to create regular expressions, stringi to match those regular expressions, and data.table to store the data.

library(rebus)
library(stringi)
library(data.table)

First trim leading and trailing spaces from the lines

tempest.v <- stri_trim(tempest.v)

Get rid of empty lines

tempest.v <- tempest.v[nzchar(tempest.v)]

Remove stage directions

stage_dir_rx <- exactly(
  OPEN_BRACKET %R%
  one_or_more(printable()) %R%
  "]" 
)
is_stage_dir_line <- stri_detect_regex(tempest.v, stage_dir_rx)
tempest.v <- tempest.v[!is_stage_dir_line]

Match lines containing "character: dialogue".

character_dialogue_rx <- START %R%
  optional(capture(one_or_more(alpha()) %R% lookahead(":"))) %R%
  optional(":") %R%
  zero_or_more(space()) %R%
  capture(one_or_more(printable()))
matches <- stri_match_first_regex(tempest.v, character_dialogue_rx)

Store the matches in a data.table (we need this for the roll functionality). A line number key column is also needed in a moment.

tempest_data <- data.table(
  line_number = seq_len(nrow(matches)),
  character = matches[, 2],
  dialogue = matches[, 3]
)

Fill in missing values, using the method described in this answer.

setkey(tempest_data, line_number)
tempest_data[, character := tempest_data[!is.na(character)][tempest_data, character, roll = TRUE]]

The data currently has line information preserved: each row contains one line of dialogue.

   line_number character                  dialogue
1:           1   Miranda If by your art, my de....
2:           2   Miranda Who had, no doubt, so....
3:           3  Prospero Be collected: No more....
4:           4   Miranda           O, woe the day!
5:           5  Prospero No harm. I have done ....
6:           6   Miranda More to know Did neve....
7:           7  Prospero 'Tis time I should in....
8:           8  Prospero Lie there, my art. Wi....

To get all the dialogue for a given character as a single string, summarise using the by argument.

tempest_data[, .(all_dialogue = paste(dialogue, collapse = "\n")), by = "character"]

Upvotes: 2

MichaelChirico
MichaelChirico

Reputation: 34733

I first saved the text you put here as test.txt. Then read it:

tempest <- scan("~/Desktop/test.txt", what = "character", sep = "\n")

Then pulled only the spoken lines, as you:

speakers <- tempest[grepl("^[^\\s]\\w+:", tempest)]

Then we split off the speaker's name:

speaker_split <- strsplit(speakers, split = ":")

And get the names:

speaker_names <- sapply(speaker_split, "[", 1L)

And what they said (collapsing because their lines may have had other colons that we lost):

speaker_parts <- sapply(speaker_split, function(x) paste(x[-1L], collapse = ":"))

From here we just need indices of who said what and we can do what we want:

prosp <- which(speaker_names == "Prospero")
miran <- which(speaker_names == "Miranda")

And play to your hearts content.

Who said the most words?

> sum(unlist(strsplit(speaker_parts[prosp], split = "")) == " ")
[1] 82
> sum(unlist(strsplit(speaker_parts[miran], split = "")) == " ")
[1] 67

Prospero.

What is the frequency of letters used by Miranda?

> table(tolower(unlist(strsplit(gsub("[^A-Za-z]", "", speaker_parts[miran]),
                       split = ""))))

 a  b  c  d  e  f  g  h  i  k  l  m  n  o  p  r  s  t  u  v  w  y 
17  3  2 11 34  7  3 21 16  5  7  7  9 17  3 14 18 30 11  5 10  8 

Upvotes: 2

Related Questions