Reputation: 35
I'm looking to split some television scripts into a data frame with two variables: (1) spoken dialogue and (2) speaker.
Here is the sample data: http://www.buffyworld.com/buffy/transcripts/127_tran.html
Loaded to R via:
require(rvest)
url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)
all <- url %>% html_text()
[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n Transcript\nWritten by Drew Goddard\n Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n \n NB: The content of this transcript, including the characters \n and the story, belongs to Mutant Enemy. This transcript was created \n based on the broadcast episode.\n \n \n \n \n BUFFYWORLD.COM \n prefers that you direct link to this transcript rather than post \n it on your site, but you can post it on your site if you really \n want, as long as you keep everything intact, this includes the link \n to buffyworld.com and this writing. Please also keep the disclaimers \n intact.\n \n Originally transcribed for: http://www.buffyworld.com/.\n\t \n TEASER (RECAP SEGMENT):\n GILES (V.O.)\n\n Previousl... <truncated>
What I'm trying now is to split at each character's name (I have a full list). For example, 'GILES' above. This works fine except I can't retain character name if I split there. Here's a simplified example.
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)
This gives me the splits I want, but doesn't retain the character name.
Finite question: Any approach to retain that character name w/ what I'm doing? Infinite question: Any other approaches I should be trying?
Thanks in advance!
Upvotes: 3
Views: 178
Reputation: 14360
I think you can use perl compatible regular expressions with strsplit
. For explanatory purposes, I used a shorter sample string, but it should work the same:
string <- "text BUFFY more text WILLOW other text"
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)
#[[1]]
#[1] "text BUFFY" " more text WILLOW" " other text"
As suggested by @Lamia, if you instead had the name before the text you could do a positive look-ahead. I edited the suggestion slightly so that the split string includes the delimiter.
strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)
#[[1]]
#[1] "text " "BUFFY more text " "WILLOW other text"
Upvotes: 3