Split strings into utterances and assign same-speaker utterances to columns in dataframe

Question

I have multi-party conversations in strings like this:

convers <- "Peter: Hiya Mary: Hi. How w'z your weekend. Peter: a::hh still got a headache. An' you (.) party a lot? Mary: nuh, you know my kid's sick 'n stuff Peter: yeah i know that's=erm al hamshi: hey guys how's it goin'? Peter: Great! Mary: where've you BEn last week al hamshi: ah y' know, camping with my girl friend."

I also have a vector with the speakers' names:

speakers <- c("Peter", "Mary", "al hamshi")

I'd like to create a dataframe with the utterances by each individual speaker in a separate column. I can only do this task in a piecemeal fashion, by addressing each speaker specifically using the indices in speakers, and then combine the separate results in a list but what I'd really like to have is a dataframe with separate columns for each speaker:

Peter <- str_extract_all(convers, paste0("(?<=", speakers[1],":\s).*?(?=\s*(?:", paste(speakers, collapse="|"),"):|\z)"))
Mary <- str_extract_all(convers, paste0("(?<=", speakers[2],":\s).*?(?=\s*(?:", paste(speakers, collapse="|"),"):|\z)"))
al_hamshi <- str_extract_all(convers, paste0("(?<=", speakers[3],":\s).*?(?=\s*(?:", paste(speakers, collapse="|"),"):|\z)"))

df <- list(
  Peter = Peter, Mary = Mary , al_hamshi = al_hamshi
)
df
$Peter
$Peter[[1]]
[1] "Hiya"                                                 "a::hh still got a headache. An' you (.) party a lot?"
[3] "yeah i know that's=erm"                               "Great!"                                              


$Mary
$Mary[[1]]
[1] "Hi. How w'z your weekend."            "nuh, you know my kid's sick 'n stuff" "where've you BEn last week"          


$al_hamshi
$al_hamshi[[1]]
[1] "hey guys how's it goin'?"                 "ah y' know, camping with my girl friend."

How can I extract the same-speaker utterances not one by one but in one go and how can the results be assigned not to a list but a dataframe?

lroha · Accepted Answer

With a bit of pre-processing, and assuming the names exactly match the speakers in the conversation text, you can do:

# Pattern to use to insert new lines in string
pattern <- paste0("(", paste0(speakers, ":", collapse = "|"), ")")

# Split string by newlines
split_conv <- strsplit(gsub(pattern, "
\1", convers), "
")[[1]][-1]

# Capture speaker and text into data frame
dat <- strcapture("(.*?):(.*)", split_conv, data.frame(speaker = character(), text = character()))

Which gives:

    speaker                                                   text
1     Peter                                                  Hiya 
2      Mary                             Hi. How w'z your weekend. 
3     Peter  a::hh still got a headache. An' you (.) party a lot? 
4      Mary                  nuh, you know my kid's sick 'n stuff 
5     Peter                                yeah i know that's=erm 
6 al hamshi                              hey guys how's it goin'? 
7     Peter                                                Great! 
8      Mary                            where've you BEn last week 
9 al hamshi               ah y' know, camping with my girl friend.

To get each speaker into their own column:

# Count lines by speaker
dat$cnt <- with(dat, ave(speaker, speaker, FUN = seq_along))

# Reshape and rename
dat <- reshape(dat, idvar = "cnt", timevar = "speaker", direction = "wide")
names(dat) <- sub("text\.", "", names(dat))

  cnt                                                  Peter                                   Mary                                 al hamshi
1   1                                                  Hiya              Hi. How w'z your weekend.                  hey guys how's it goin'? 
3   2  a::hh still got a headache. An' you (.) party a lot?   nuh, you know my kid's sick 'n stuff   ah y' know, camping with my girl friend.
5   3                                yeah i know that's=erm             where've you BEn last week                                       
7   4                                                Great!

If new lines already exist in your text, choose another character that doesn't exist to do use to split the string.

Split strings into utterances and assign same-speaker utterances to columns in dataframe

Answers (2)

Related Questions