Chris Ruehlemann
Chris Ruehlemann

Reputation: 21410

Split strings into utterances and assign same-speaker utterances to columns in dataframe

I have multi-party conversations in strings like this:

convers <- "Peter: Hiya Mary: Hi. How w'z your weekend. Peter: a::hh still got a headache. An' you (.) party a lot? Mary: nuh, you know my kid's sick 'n stuff Peter: yeah i know that's=erm al hamshi: hey guys how's it goin'? Peter: Great! Mary: where've you BEn last week al hamshi: ah y' know, camping with my girl friend."

I also have a vector with the speakers' names:

speakers <- c("Peter", "Mary", "al hamshi")

I'd like to create a dataframe with the utterances by each individual speaker in a separate column. I can only do this task in a piecemeal fashion, by addressing each speaker specifically using the indices in speakers, and then combine the separate results in a list but what I'd really like to have is a dataframe with separate columns for each speaker:

Peter <- str_extract_all(convers, paste0("(?<=", speakers[1],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
Mary <- str_extract_all(convers, paste0("(?<=", speakers[2],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
al_hamshi <- str_extract_all(convers, paste0("(?<=", speakers[3],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))

df <- list(
  Peter = Peter, Mary = Mary , al_hamshi = al_hamshi
)
df
$Peter
$Peter[[1]]
[1] "Hiya"                                                 "a::hh still got a headache. An' you (.) party a lot?"
[3] "yeah i know that's=erm"                               "Great!"                                              


$Mary
$Mary[[1]]
[1] "Hi. How w'z your weekend."            "nuh, you know my kid's sick 'n stuff" "where've you BEn last week"          


$al_hamshi
$al_hamshi[[1]]
[1] "hey guys how's it goin'?"                 "ah y' know, camping with my girl friend."

How can I extract the same-speaker utterances not one by one but in one go and how can the results be assigned not to a list but a dataframe?

Upvotes: 2

Views: 143

Answers (2)

GKi
GKi

Reputation: 39697

You can add :\\s to each speakers, as you are also doing, then make a gregexpr finding the position where a speaker starts. Extract this using regmatches and remove the previously added :\\s to get the speaker. Make again a regmatches but with invert giving the sentences. With spilt the sentences are grouped to the speaker. To bring this to the desired data.frame you have to add NA to have the same length for all speakes, done her with [ inside lapply:

x <- gregexpr(paste0(speakers, ":\\s", collapse="|"), convers)
y <- sub(":\\s$", "", regmatches(convers, x)[[1]])
z <- trimws(regmatches(convers, x, TRUE)[[1]][-1])
tt <- split(z, y)
do.call(data.frame, lapply(tt, "[", seq_len(max(lengths(tt)))))
#                                 al.hamshi                                 Mary                                                Peter
#1                 hey guys how's it goin'?            Hi. How w'z your weekend.                                                 Hiya
#2 ah y' know, camping with my girl friend. nuh, you know my kid's sick 'n stuff a::hh still got a headache. An' you (.) party a lot?
#3                                     <NA>           where've you BEn last week                               yeah i know that's=erm
#4                                     <NA>                                 <NA>                                               Great!

Upvotes: 2

lroha
lroha

Reputation: 34556

With a bit of pre-processing, and assuming the names exactly match the speakers in the conversation text, you can do:

# Pattern to use to insert new lines in string
pattern <- paste0("(", paste0(speakers, ":", collapse = "|"), ")")

# Split string by newlines
split_conv <- strsplit(gsub(pattern, "\n\\1", convers), "\n")[[1]][-1]

# Capture speaker and text into data frame
dat <- strcapture("(.*?):(.*)", split_conv, data.frame(speaker = character(), text = character()))

Which gives:

    speaker                                                   text
1     Peter                                                  Hiya 
2      Mary                             Hi. How w'z your weekend. 
3     Peter  a::hh still got a headache. An' you (.) party a lot? 
4      Mary                  nuh, you know my kid's sick 'n stuff 
5     Peter                                yeah i know that's=erm 
6 al hamshi                              hey guys how's it goin'? 
7     Peter                                                Great! 
8      Mary                            where've you BEn last week 
9 al hamshi               ah y' know, camping with my girl friend.

To get each speaker into their own column:

# Count lines by speaker
dat$cnt <- with(dat, ave(speaker, speaker, FUN = seq_along))

# Reshape and rename
dat <- reshape(dat, idvar = "cnt", timevar = "speaker", direction = "wide")
names(dat) <- sub("text\\.", "", names(dat))

  cnt                                                  Peter                                   Mary                                 al hamshi
1   1                                                  Hiya              Hi. How w'z your weekend.                  hey guys how's it goin'? 
3   2  a::hh still got a headache. An' you (.) party a lot?   nuh, you know my kid's sick 'n stuff   ah y' know, camping with my girl friend.
5   3                                yeah i know that's=erm             where've you BEn last week                                       <NA>
7   4                                                Great!                                    <NA>                                      <NA>

If new lines already exist in your text, choose another character that doesn't exist to do use to split the string.

Upvotes: 3

Related Questions