Reputation: 627
I have a record of conversations between two arbitrary persons A and B.
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
The data frame looks like this:
df <- data.frame(id = rbind(123, 345), conversation = rbind(c1, c2))
df
id conversation
c1 123 Person A: blabla...something Person B: blabla something else Person A: OK blabla
c2 345 Person A: again blabla Person B: blabla something else Person A: thanks blabla
Now I would like to extract only the part of person A and put it in a data frame. The result should be:
id person_A
1 123 blabla...something OK blabla
2 345 again blabla thanks blabla
Upvotes: 8
Views: 345
Reputation: 109874
I'm a big fan of solving this sort of problem in a way that gives you access to all the data (that includes Person B's discourse as well). I love tidyr's extract
for this sort of column splitting. I used to use a do.call(rbind, strsplit()))
approach but love how clean the extract
approach is.
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
c3 <- "Person A: again blabla Person B: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr)
conv <- strsplit(as.character(df[["conversation"]]), "\\s+(?=Person\\s)", perl=TRUE)
df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=FALSE]
rownames(df2) <- NULL
df2[["conversation"]] <- unlist(conv)
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\\s+(.+)")
## id Person Conversation
## 1 123 Person A blabla...something
## 2 123 Person B blabla something else
## 3 123 Person A OK blabla
## 4 345 Person A again blabla
## 5 345 Person B blabla something else
## 6 345 Person A thanks blabla
## 7 567 Person A again blabla
## 8 567 Person B blabla something else
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\\s+(.+)") %>%
filter(Person == "Person A")
## id Person Conversation
## 1 123 Person A blabla...something
## 2 123 Person A OK blabla
## 3 345 Person A again blabla
## 4 345 Person A thanks blabla
## 5 567 Person A again blabla
Or collapse them as you show in the desired output:
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\\s+(.+)") %>%
filter(Person == "Person A") %>%
group_by(id) %>%
select(-Person) %>%
summarise(Person_A =paste(Conversation, collapse=" "))
## id Person_A
## 1 123 blabla...something OK blabla
## 2 345 again blabla thanks blabla
## 3 567 again blabla
Edit: In reality I suspect your data has real names like "john Smith" vs. "Person A". If this is the case this initial regex split will capture a first and last name that uses caps followed by a colon:
c1 <- "Greg Smith: blabla...something Sue Williams: blabla something else Greg Smith: OK blabla"
c2 <- "Greg Smith: again blabla Sue Williams: blabla something else Greg Smith: thanks blabla"
c3 <- "Greg Smith: again blabla Sue Williams: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))r
conv <- strsplit(as.character(df[["conversation"]]), "\\s+(?=([A-Z][a-z]+\\s+[A-Z][a-z]+:))", perl=TRUE)
df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=FALSE]
rownames(df2) <- NULL
df2[["conversation"]] <- unlist(conv)
df2 %>%
extract(conversation, c("Person", "Conversation"), "([^:]+):\\s+(.+)")
## id Person Conversation
## 1 123 Greg Smith blabla...something
## 2 123 Sue Williams blabla something else
## 3 123 Greg Smith OK blabla
## 4 345 Greg Smith again blabla
## 5 345 Sue Williams blabla something else
## 6 345 Greg Smith thanks blabla
## 7 567 Greg Smith again blabla
## 8 567 Sue Williams blabla something else
Upvotes: 4
Reputation: 118809
Using data.table and
gsub` from base R:
require(data.table)
setDT(df)[, Person_A := gsub(".*Person A:[ ]*(.*)[ ]*Person B.*:[ ]*(.*)$",
"\\1\\2", conversation)][, conversation := NULL]
df
# id Person_A
# 1: 123 blabla...something OK blabla
# 2: 345 again blabla thanks blabla
Upvotes: 1
Reputation: 7119
This is my try, I have also added a second conversation started by Person B and a conversation also ended by Person B, just to cover also these cases:
c1 <- "Person A: blabla...something Person B: blabla something else Person A: OK blabla"
c2 <- "Person A: again blabla Person B: blabla something else Person A: thanks blabla"
c3 <- "Person A: again blabla Person B: blabla something else"
df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))
df$PersonA <- gsub("(Person A: |Person B: .+? (?<= Person A: )|Person B: .+?\\Z)", "", df$conversation, perl = TRUE)
df$PersonA
What I'm doing with gsub
is removing:
\Z
I used the perl = TRUE
because life is too short to not to use the rearview mirror... ehm... the lookbehind operator.
Upvotes: 0
Reputation: 51650
Using the stringr
package
First we split the string using "Person A: " as a delimiter
library(stringr)
conv.split <- str_split(df$conversation, "Person A: ")
This will give us all pieces of conversation started by A with attached the (optional) answer by B
We now remove B's answers
conv.split <- lapply(conv.split, function(x){str_split(x, "Person B:.*")})
And finally we unlist each element and collapse it together into a string
sapply(conv.split, function(x){x <- unlist(x); paste(x, collapse = "")})
Result:
[1] "blabla...something OK blabla" "again blabla thanks blabla"
Works also in the case where B starts the conversation, if only one of the two is speaking and also for long conversations.
Upvotes: 2
Reputation: 5951
It might not work for all your cases. Especially ones that the conversation is started from Person B
. Let me know if it is the case. Else try
df$person_A <- gsub("Person B.*:|Person A:", "", df$conversation)
df <- data.frame(df$id, df$person_A)
Upvotes: 0