Reputation: 21410
I have multi-party conversations in strings like this:
convers <- "Peter: Hiya Mary: Hi. How w'z your weekend. Peter: a::hh still got a headache. An' you (.) party a lot? Mary: nuh, you know my kid's sick 'n stuff Peter: yeah i know that's=erm al hamshi: hey guys how's it goin'? Peter: Great! Mary: where've you BEn last week al hamshi: ah y' know, camping with my girl friend."
I also have a vector with the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
I'd like to create a dataframe with the utterances by each individual speaker in a separate column. I can only do this task in a piecemeal fashion, by addressing each speaker specifically using the indices in speakers
, and then combine the separate results in a list but what I'd really like to have is a dataframe with separate columns for each speaker:
Peter <- str_extract_all(convers, paste0("(?<=", speakers[1],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
Mary <- str_extract_all(convers, paste0("(?<=", speakers[2],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
al_hamshi <- str_extract_all(convers, paste0("(?<=", speakers[3],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
df <- list(
Peter = Peter, Mary = Mary , al_hamshi = al_hamshi
)
df
$Peter
$Peter[[1]]
[1] "Hiya" "a::hh still got a headache. An' you (.) party a lot?"
[3] "yeah i know that's=erm" "Great!"
$Mary
$Mary[[1]]
[1] "Hi. How w'z your weekend." "nuh, you know my kid's sick 'n stuff" "where've you BEn last week"
$al_hamshi
$al_hamshi[[1]]
[1] "hey guys how's it goin'?" "ah y' know, camping with my girl friend."
How can I extract the same-speaker utterances not one by one but in one go and how can the results be assigned not to a list but a dataframe?
Upvotes: 2
Views: 143
Reputation: 39697
You can add :\\s
to each speakers
, as you are also doing, then make a gregexpr
finding the position where a speaker starts. Extract this using regmatches
and remove the previously added :\\s
to get the speaker. Make again a regmatches
but with invert
giving the sentences. With spilt
the sentences are grouped to the speaker. To bring this to the desired data.frame
you have to add NA
to have the same length for all speakes, done her with [
inside lapply
:
x <- gregexpr(paste0(speakers, ":\\s", collapse="|"), convers)
y <- sub(":\\s$", "", regmatches(convers, x)[[1]])
z <- trimws(regmatches(convers, x, TRUE)[[1]][-1])
tt <- split(z, y)
do.call(data.frame, lapply(tt, "[", seq_len(max(lengths(tt)))))
# al.hamshi Mary Peter
#1 hey guys how's it goin'? Hi. How w'z your weekend. Hiya
#2 ah y' know, camping with my girl friend. nuh, you know my kid's sick 'n stuff a::hh still got a headache. An' you (.) party a lot?
#3 <NA> where've you BEn last week yeah i know that's=erm
#4 <NA> <NA> Great!
Upvotes: 2
Reputation: 34556
With a bit of pre-processing, and assuming the names exactly match the speakers in the conversation text, you can do:
# Pattern to use to insert new lines in string
pattern <- paste0("(", paste0(speakers, ":", collapse = "|"), ")")
# Split string by newlines
split_conv <- strsplit(gsub(pattern, "\n\\1", convers), "\n")[[1]][-1]
# Capture speaker and text into data frame
dat <- strcapture("(.*?):(.*)", split_conv, data.frame(speaker = character(), text = character()))
Which gives:
speaker text
1 Peter Hiya
2 Mary Hi. How w'z your weekend.
3 Peter a::hh still got a headache. An' you (.) party a lot?
4 Mary nuh, you know my kid's sick 'n stuff
5 Peter yeah i know that's=erm
6 al hamshi hey guys how's it goin'?
7 Peter Great!
8 Mary where've you BEn last week
9 al hamshi ah y' know, camping with my girl friend.
To get each speaker into their own column:
# Count lines by speaker
dat$cnt <- with(dat, ave(speaker, speaker, FUN = seq_along))
# Reshape and rename
dat <- reshape(dat, idvar = "cnt", timevar = "speaker", direction = "wide")
names(dat) <- sub("text\\.", "", names(dat))
cnt Peter Mary al hamshi
1 1 Hiya Hi. How w'z your weekend. hey guys how's it goin'?
3 2 a::hh still got a headache. An' you (.) party a lot? nuh, you know my kid's sick 'n stuff ah y' know, camping with my girl friend.
5 3 yeah i know that's=erm where've you BEn last week <NA>
7 4 Great! <NA> <NA>
If new lines already exist in your text, choose another character that doesn't exist to do use to split the string.
Upvotes: 3