Split long string by a vector of words

Question

I'm looking to split some television scripts into a data frame with two variables: (1) spoken dialogue and (2) speaker.

Here is the sample data: http://www.buffyworld.com/buffy/transcripts/127_tran.html

Loaded to R via:

require(rvest)

url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)

all <- url %>% html_text()

[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript

Buffy Episode #127: "Selfless" 
  Transcript
Written by Drew Goddard
  Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. 

 
   
        NB: The content of this transcript, including the characters 
          and the story, belongs to Mutant Enemy. This transcript was created 
          based on the broadcast episode.
      
       
      
             
            BUFFYWORLD.COM 
              prefers that you direct link to this transcript rather than post 
              it on your site, but you can post it on your site if you really 
              want, as long as you keep everything intact, this includes the link 
              to buffyworld.com and this writing. Please also keep the disclaimers 
              intact.
            
            Originally transcribed for: http://www.buffyworld.com/.
	  
    TEASER (RECAP SEGMENT):
  GILES (V.O.)

  Previousl...

What I'm trying now is to split at each character's name (I have a full list). For example, 'GILES' above. This works fine except I can't retain character name if I split there. Here's a simplified example.

to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)

This gives me the splits I want, but doesn't retain the character name.

Finite question: Any approach to retain that character name w/ what I'm doing? Infinite question: Any other approaches I should be trying?

Thanks in advance!

Mike H. · Accepted Answer

I think you can use perl compatible regular expressions with strsplit. For explanatory purposes, I used a shorter sample string, but it should work the same:

string <- "text BUFFY more text WILLOW other text"

to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)

#[[1]]
#[1] "text BUFFY"        " more text WILLOW" " other text"

As suggested by @Lamia, if you instead had the name before the text you could do a positive look-ahead. I edited the suggestion slightly so that the split string includes the delimiter.

strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)

#[[1]]
#[1] "text "             "BUFFY more text "  "WILLOW other text"

Split long string by a vector of words

Answers (1)

Related Questions