MrGraeme
MrGraeme

Reputation: 97

Finding strings shared between two dataframe columns

I have a data frame with two columns of delimited strings:

df <- data.frame('a'=c('a, b, c, d', 'a, c', 'b, d'), 'b'=c('a, d', 'a', 'a, d'))

           a      b
1 a, b, c, d   a, d
2       a, c      a
3       b, d   a, d

And I would like to create a third column to include the strings that intersect the first two columns, for example:

           a      b      c
1 a, b, c, d   a, d   a, d
2       a, c      a      a
3       b, d   a, d      d

I have tried a number of approaches that involve converting the strings to lists and back but I don't seem to be able to get it right.

Using dplyr I first attempted to use:

df <- df %>%
    mutate(c=paste(c(intersect(unlist(strsplit(a, split=", ")), unlist(strsplit(b, split=", "))))))

Which resulted in an error:

Error in eval(substitute(expr), envir, enclos) : wrong result size (2), expected 3 or 1

As well as not returning the required string, this also seems to return results of the same size for each row (verified by changing the mutate function above from paste to length below):

df %>%
    mutate(c=length(c(intersect(unlist(strsplit(a, split=", ")), unlist(strsplit(b, split=", "))))))

           a    b   c
1 a, b, c, d a, d   2
2       a, c    a   2
3       b, d a, d   2

Which makes me worry that all my row results are being combined into one result and repeated.

To try to simplify things I attempted to convert my strings into lists before using the intersect function:

df %>% mutate(a_list=list(unlist(strsplit(a, split=", "))))

But received the error:

Error in eval(substitute(expr), envir, enclos) : not compatible with STRSXP

Which makes wonder if lists in data frames are compatible with the tidyverse and, as such, if I need to take an entirely different approach.

Any advice on how to approach the problem of finding strings shared between two data frame columns in R (as well as any insight into how to treat list like values in data frames) would be gratefully received.

Upvotes: 0

Views: 778

Answers (2)

lmo
lmo

Reputation: 38510

This base R method will work: use strsplit to split the variables into lists with each element a vector of characters. The mapply function takes the lists and applies the following operation to pairs of elements in each list that are in the same postion. Then use insersect to find overlapping elements and paste with collapse to paste these together.

df$c <- mapply(function(x, y) paste(intersect(x, y), collapse=", "),
               strsplit(df$a, ", "), strsplit(df$b, ", "))

df
           a    b    c
1 a, b, c, d a, d a, d
2       a, c    a    a
3       b, d a, d    d

data

df <- data.frame('a'=c('a, b, c, d', 'a, c', 'b, d'),
                 'b'=c('a, d', 'a', 'a, d'), stringsAsFactors=FALSE)

Upvotes: 1

Roman
Roman

Reputation: 17648

You can try:

library(stringr)
# go go through each row, extract the letters, search for duplicates and paste together
apply(df, 1, function(x){
  tmp <- str_trim(unlist(str_split(x,",")))
  paste(tmp[duplicated(tmp)],collapse=", ")
 })
[1] "a, d" "a"   "d" 

Upvotes: 0

Related Questions