Reputation: 11657
I have several rows of chat data that contain transcripts which look like this:
"Participant 1 (Me): I don't know the answer to this. Participant 2: What do you think? Maybe 20%? Participant 2: I don't know either. Participant 1 (Me): that was my guess Participant 2: ok, let's go for it! ...what do you think? Participant 1 (Me): sure! Participant 2: ok! Participant 2: aww! sorry!"
I would like to extract only the text that follows the prefix Participant 1 (Me):
up until it says either Participant 1
or Participant 2
. All the text that follows immediately after Participant 1
up until the aforementioned delimiters should be stored in variable called participant_1_text
. I'd like to store all the remaining text in a separate variable called participant_2_text
, like so:
participant_1_text = "I don't know the answer to this. that was my guess. sure!
participant_2_text = "What do you think? Maybe 20%? I don't know either. ok, let's go for
it! ...what do you think? ok! aww! sorry!"
So all of Participant 1's text and all of Participant 2's texts are now separated.
I tried something like the following regex:
(?<=Participant 1)(.*)(?=Participant 2)
But that will match all text between the first and last occurrence of those two delimiters, instead of every match.
Edit: I'm trying to now take below versions of code and apply them to a dataframe containing lots of chat transcripts:
So, taking @akrun's code I made a function that separates out a given chat log to my_chat
or partner_chat
and returns a named list:
extract_chat <- function(chat_text){
final_output = chat_text %>%
tibble(col1 = chat_text) %>%
mutate(col1 = str_replace_all(col1, "Participant", "\nParticipant")) %>%
separate_rows(col1, sep="\n") %>%
filter(nzchar(col1)) %>% #filter the non-empty strings
separate(col1, into = c('Participant', "text"), sep=":") %>%
group_by(Participant) %>%
summarise(text = str_c(text, collapse= ' ')) %>%
mutate(Participant = ifelse(str_detect(Participant, "(Me)"), "my_chat_extracted", "partner_chat_extracted")) %>%
spread(Participant, text)
return(list(my_chat_extracted = final_output$my_chat_extracted,
partner_chat_extracted = final_output$partner_chat_extracted))
}
This seems to work fine, but I'm not sure how to mutate the actual columns in my data-frame to use this function.
Here's an example of a data.frame to use:
str1 <- "Participant 1 (Me): I don't know the answer to this. Participant 2: What do you think? Maybe 20%? Participant 2: I don't know either. Participant 1 (Me): that was my guess Participant 2: ok, let's go for it! ...what do you think? Participant 1 (Me): sure! Participant 2: ok! Participant 2: aww! sorry!"
str2 <- "Participant 1 (Me): Hey, how are you? Participant 2: I'm good, how about you? Participant 2: I'm excited. Participant 1 (Me): I'm also good."
test = data.frame(chat = c(str1, str2))
I want to do something like:
tester = test %>%
rowwise() %>%
mutate(my_chat_extracted = extract_chat(chat)$my_chat_extracted)
But this seems to be pretty slow on my actual dataset, and feels sloppy.
Upvotes: 3
Views: 533
Reputation: 1428
Here's another method using stringr:
library(stringr)
txt <- "Participant 1 (Me): I don't know the answer to this. Participant 2: What do you think? Maybe 20%? Participant 2: I don't know either. Participant 1 (Me): that was my guess Participant 2: ok, let's go for it! ...what do you think? Participant 1 (Me): sure! Participant 2: ok! Participant 2: aww! sorry!"
txt %>%
str_split("(?=Participant.+:)", simplify = TRUE) %>%
str_split(": ", simplify = TRUE) %>%
.[-1, ]
#> [,1] [,2]
#> [1,] "Participant 1 (Me)" "I don't know the answer to this. "
#> [2,] "Participant 2" "What do you think? Maybe 20%? "
#> [3,] "Participant 2" "I don't know either. "
#> [4,] "Participant 1 (Me)" "that was my guess "
#> [5,] "Participant 2" "ok, let's go for it! ...what do you think? "
#> [6,] "Participant 1 (Me)" "sure! "
#> [7,] "Participant 2" "ok! "
#> [8,] "Participant 2" "aww! sorry!"
Created on 2020-06-17 by the reprex package (v0.3.0)
Upvotes: 3
Reputation: 173858
Another way to do this using stringr
, where s
is the given string:
r <- "Participant \\d( \\(Me\\))?: "
cbind(unlist(stringr::str_extract_all(s, r)), strsplit(s, r)[[1]][-1])
#> [,1] [,2]
#> [1,] "Participant 1 (Me): " "I don't know the answer to this. "
#> [2,] "Participant 2: " "What do you think? Maybe 20%? "
#> [3,] "Participant 2: " "I don't know either. "
#> [4,] "Participant 1 (Me): " "that was my guess "
#> [5,] "Participant 2: " "ok, let's go for it! ...what do you think? "
#> [6,] "Participant 1 (Me): " "sure! "
#> [7,] "Participant 2: " "ok! "
#> [8,] "Participant 2: " "aww! sorry!"
Upvotes: 3
Reputation: 887128
We can insert a next line character before the Participant
(with str_replace_all
), then split at the \n
with separate_rows
, filter
out any blanks (nzchar
), separate
the column into two at :
, grouped by 'Participant', paste
the 'text' strings into a single string
library(dplyr)
library(stringr)
library(tidyr)
out <- tibble(col1 = str1) %>%
mutate(col1 = str_replace_all(col1, "Participant", "\nParticipant")) %>%
separate_rows(col1, sep="\n") %>%
filter(nzchar(col1)) %>%
separate(col1, into = c('Participant', "text"), sep=":") %>%
group_by(Participant = str_remove(Participant, "\\s*\\(.*")) %>%
summarise(text = str_c(text, collapse= ' '))
out
# A tibble: 2 x 2
# Participant text
# <chr> <chr>
#1 Participant 1 " I don't know the answer to this. that was my guess sure! "
#2 Participant 2 " What do you think? Maybe 20%? I don't know either. ok, let's go for it! ...what do you think? ok! aww! sorry!"
It may be better to keep it in a data.frame
, but if we need separate objects use list2env
after deframe
ing
library(tibble)
list2env(as.list(deframe(out)), .GlobalEnv)
`Participant 1`
#[1] " I don't know the answer to this. that was my guess sure! "
str1 <- "Participant 1 (Me): I don't know the answer to this. Participant 2: What do you think? Maybe 20%? Participant 2: I don't know either. Participant 1 (Me): that was my guess Participant 2: ok, let's go for it! ...what do you think? Participant 1 (Me): sure! Participant 2: ok! Participant 2: aww! sorry!"
Upvotes: 3