Reputation: 31
So far, I have imported the text:
tempest.v <- scan("data/plainText/tempest.txt", what="character", sep="\n")
Identified where all of the speaker positions begin:
speaker.positions.v <- grep('^[^\\s]\\w+:', tempest.v)
Added a marker at the end of the text:
tempest.v <- c(tempest.v, "END:")
Here's the part where I'm having difficulty (assuming what I've already done is useful):
for(i in 1:length(speaker.positions.v)){
if(i != length(speaker.positions.v)){
speaker.name <- debate.v[speaker.positions.v[i]]
speaker.name <- strsplit(speaker.name, ":")
speaker.name <- unlist(speaker.name)
start <- speaker.positions.v[i]+1
end <- speaker.positions.v[i+1]-1
speaker.lines.v <- debate.v[start:end]
}
}
Now I have variable speaker.name that has, on the left-hand side of the split, the name of the character who is speaking. The right-hand side of the split is the dialogue only up through the first line break.
I set the start of the dialogue block at position [i]+1 and the end at [i+1]-1 (i.e., one position back from the beginning of the subsequent speaker's name).
Now I have a variable, speaker.lines.v with all of the lines of dialogue for that speaker for that one speech.
How can I collect all of Prospero's then Miranda's (then any other character's) dialogue into a single (list? vector? data frame?) for analysis?
Any help with this would be greatly appreciated.
Happy New Year!
--- *TEXT ---
*Miranda: If by your art, my dearest father, you have Put the wild waters in this roar, allay them. The sky, it seems, would pour down stinking pitch, But that the sea, mounting to the welkin's cheek, Dashes the fire out. O, I have suffered With those that I saw suffer -- a brave vessel,
Who had, no doubt, some noble creature in her, Dash'd all to pieces. O, the cry did knock Against my very heart. Poor souls, they perish'd. Had I been any god of power, I would Have sunk the sea within the earth or ere It should the good ship so have swallow'd and The fraughting souls within her.
Prospero: Be collected: No more amazement: tell your piteous heart There's no harm done.
Miranda: O, woe the day!
Prospero: No harm. I have done nothing but in care of thee, Of thee, my dear one, thee, my daughter, who Art ignorant of what thou art, nought knowing Of whence I am, nor that I am more better Than Prospero, master of a full poor cell, And thy no greater father.
Miranda: More to know Did never meddle with my thoughts.
Prospero: 'Tis time I should inform thee farther. Lend thy hand, And pluck my magic garment from me. So:
[Lays down his mantle]
Lie there, my art. Wipe thou thine eyes; have comfort. The direful spectacle of the wreck, which touch'd The very virtue of compassion in thee, I have with such provision in mine art So safely ordered that there is no soul— No, not so much perdition as an hair Betid to any creature in the vessel Which thou heard'st cry, which thou saw'st sink. Sit down; For thou must now know farther.
--- END TEXT ---
Upvotes: 2
Views: 188
Reputation: 109894
I was interested in this question because I'm developing a series of tools for these types of tasks. Here is how to solve this problem using those tools.
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape", "trinker/qdapRegex")
pacman::p_load(dplyr)
pat <- '^[^\\s]\\w+:'
"tempest.txt" %>%
readLines() %>%
{.[!grepl("^(---)|(^\\s*$)", .)]} %>%
split_match(pat, regex=TRUE, include=TRUE) %>%
textshape::combine() %>%
{setNames(., sapply(., function(x) unlist(ex_default(x, pattern = pat))))} %>%
bind_list("person") %>%
mutate(content = gsub(pat, "", content)) %>%
`[` %>%
textshape::combine()
result
person content
1 Miranda: If by your art, my dearest father, you ...
2 Prospero: Be collected No more amazement tell you ..
To avoid combining (As @RichieCotton displays initially) leave off the last textshape::combine()
in the chain.
Upvotes: 2
Reputation: 121087
We're going to use the rebus
package to create regular expressions, stringi
to match those regular expressions, and data.table
to store the data.
library(rebus)
library(stringi)
library(data.table)
First trim leading and trailing spaces from the lines
tempest.v <- stri_trim(tempest.v)
Get rid of empty lines
tempest.v <- tempest.v[nzchar(tempest.v)]
Remove stage directions
stage_dir_rx <- exactly(
OPEN_BRACKET %R%
one_or_more(printable()) %R%
"]"
)
is_stage_dir_line <- stri_detect_regex(tempest.v, stage_dir_rx)
tempest.v <- tempest.v[!is_stage_dir_line]
Match lines containing "character: dialogue".
character_dialogue_rx <- START %R%
optional(capture(one_or_more(alpha()) %R% lookahead(":"))) %R%
optional(":") %R%
zero_or_more(space()) %R%
capture(one_or_more(printable()))
matches <- stri_match_first_regex(tempest.v, character_dialogue_rx)
Store the matches in a data.table
(we need this for the roll
functionality). A line number key column is also needed in a moment.
tempest_data <- data.table(
line_number = seq_len(nrow(matches)),
character = matches[, 2],
dialogue = matches[, 3]
)
Fill in missing values, using the method described in this answer.
setkey(tempest_data, line_number)
tempest_data[, character := tempest_data[!is.na(character)][tempest_data, character, roll = TRUE]]
The data currently has line information preserved: each row contains one line of dialogue.
line_number character dialogue
1: 1 Miranda If by your art, my de....
2: 2 Miranda Who had, no doubt, so....
3: 3 Prospero Be collected: No more....
4: 4 Miranda O, woe the day!
5: 5 Prospero No harm. I have done ....
6: 6 Miranda More to know Did neve....
7: 7 Prospero 'Tis time I should in....
8: 8 Prospero Lie there, my art. Wi....
To get all the dialogue for a given character as a single string, summarise using the by argument.
tempest_data[, .(all_dialogue = paste(dialogue, collapse = "\n")), by = "character"]
Upvotes: 2
Reputation: 34733
I first saved the text you put here as test.txt. Then read it:
tempest <- scan("~/Desktop/test.txt", what = "character", sep = "\n")
Then pulled only the spoken lines, as you:
speakers <- tempest[grepl("^[^\\s]\\w+:", tempest)]
Then we split off the speaker's name:
speaker_split <- strsplit(speakers, split = ":")
And get the names:
speaker_names <- sapply(speaker_split, "[", 1L)
And what they said (collapsing because their lines may have had other colons that we lost):
speaker_parts <- sapply(speaker_split, function(x) paste(x[-1L], collapse = ":"))
From here we just need indices of who said what and we can do what we want:
prosp <- which(speaker_names == "Prospero")
miran <- which(speaker_names == "Miranda")
And play to your hearts content.
Who said the most words?
> sum(unlist(strsplit(speaker_parts[prosp], split = "")) == " ")
[1] 82
> sum(unlist(strsplit(speaker_parts[miran], split = "")) == " ")
[1] 67
Prospero.
What is the frequency of letters used by Miranda?
> table(tolower(unlist(strsplit(gsub("[^A-Za-z]", "", speaker_parts[miran]),
split = ""))))
a b c d e f g h i k l m n o p r s t u v w y
17 3 2 11 34 7 3 21 16 5 7 7 9 17 3 14 18 30 11 5 10 8
Upvotes: 2