Reputation: 97
I have a data frame with two columns of delimited strings:
df <- data.frame('a'=c('a, b, c, d', 'a, c', 'b, d'), 'b'=c('a, d', 'a', 'a, d'))
a b
1 a, b, c, d a, d
2 a, c a
3 b, d a, d
And I would like to create a third column to include the strings that intersect the first two columns, for example:
a b c
1 a, b, c, d a, d a, d
2 a, c a a
3 b, d a, d d
I have tried a number of approaches that involve converting the strings to lists and back but I don't seem to be able to get it right.
Using dplyr
I first attempted to use:
df <- df %>%
mutate(c=paste(c(intersect(unlist(strsplit(a, split=", ")), unlist(strsplit(b, split=", "))))))
Which resulted in an error:
Error in eval(substitute(expr), envir, enclos) : wrong result size (2), expected 3 or 1
As well as not returning the required string, this also seems to return results of the same size for each row (verified by changing the mutate
function above from paste
to length
below):
df %>%
mutate(c=length(c(intersect(unlist(strsplit(a, split=", ")), unlist(strsplit(b, split=", "))))))
a b c
1 a, b, c, d a, d 2
2 a, c a 2
3 b, d a, d 2
Which makes me worry that all my row results are being combined into one result and repeated.
To try to simplify things I attempted to convert my strings into lists before using the intersect function:
df %>% mutate(a_list=list(unlist(strsplit(a, split=", "))))
But received the error:
Error in eval(substitute(expr), envir, enclos) : not compatible with STRSXP
Which makes wonder if lists in data frames are compatible with the tidyverse
and, as such, if I need to take an entirely different approach.
Any advice on how to approach the problem of finding strings shared between two data frame columns in R (as well as any insight into how to treat list like values in data frames) would be gratefully received.
Upvotes: 0
Views: 778
Reputation: 38510
This base R method will work: use strsplit
to split the variables into lists with each element a vector of characters. The mapply
function takes the lists and applies the following operation to pairs of elements in each list that are in the same postion. Then use insersect
to find overlapping elements and paste
with collapse to paste these together.
df$c <- mapply(function(x, y) paste(intersect(x, y), collapse=", "),
strsplit(df$a, ", "), strsplit(df$b, ", "))
df
a b c
1 a, b, c, d a, d a, d
2 a, c a a
3 b, d a, d d
data
df <- data.frame('a'=c('a, b, c, d', 'a, c', 'b, d'),
'b'=c('a, d', 'a', 'a, d'), stringsAsFactors=FALSE)
Upvotes: 1
Reputation: 17648
You can try:
library(stringr)
# go go through each row, extract the letters, search for duplicates and paste together
apply(df, 1, function(x){
tmp <- str_trim(unlist(str_split(x,",")))
paste(tmp[duplicated(tmp)],collapse=", ")
})
[1] "a, d" "a" "d"
Upvotes: 0