DN1
DN1

Reputation: 218

How to identify matching strings between datasets?

I've been trying to use answers to other similar questions but had no luck. I have 2 datasets:

#df1:
Gene
ACE
BRCA
HER2
#df2:
Gene       interactors
GP5       ACE, NOS, C456
TP53      NOS, BRCA, NOTCH4

I am looking to add a column to my first dataset to identify genes which appear as interactors in my second dataset.

Output:

#df1:
Gene   Matches
ACE      TRUE
BRCA     TRUE
HER2     FALSE

Currently I'm trying df1$Matches <- mapply(grepl, df1$Gene, df2$interactors) This runs but when I increase the number of genes in df1 the number of matches drops, which doesn't make sense as I don't remove any genes that were ran originally, making me think this isn't working like I expect.

I've also tried:

library(stringr)
df1 %>% 
+     rowwise() %>% 
+     mutate(exists_in_title = str_detect(Gene, df2$interactors))
Error: Column `exists_in_title` must be length 1 (the group size), not 3654
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I've also tried a dplyr version of this with the same error.

What other ways I can approach this? Any help would be appreciated.

Input data:

dput(df1)
structure(list(Gene = c("ACE", "BRCA", "HER2")), row.names = c(NA, 
-3L), class = c("data.table", "data.frame"))

dput(df2)
structure(list(Gene = c("GP5", "TP53"), interactors = c("ACE, NOS, C456", 
"NOS, BRCA, NOTCH4")), row.names = c(NA, -2L), class = c("data.table", 
"data.frame"))

Upvotes: 2

Views: 49

Answers (3)

m.k.
m.k.

Reputation: 332

With base R

genes <- df1$Gene
res <- genes %in% trimws(unlist(strsplit(df2$interactors, ",")))

Result

> res
[1]  TRUE  TRUE FALSE

Which can be added onto df1 with

df1$Matches <- res

Upvotes: 0

Mohanasundaram
Mohanasundaram

Reputation: 2949

You can split use strsplit

library(dplyr)
df1$Matches <-  df1$Gene %in% trimws(unlist(strsplit(df2$interactors, ",")))

> df1
  Gene Matches
1  ACE    TRUE
2 BRCA    TRUE
3 HER2   FALSE

Upvotes: 1

Len Greski
Len Greski

Reputation: 10845

Here is an answer combining tidyr and Base R. First, we read the data:

text1 <- "Gene
ACE
BRCA
HER2"
text2 <- "Gene|interactors
GP5|ACE, NOS, C456
TP53|NOS, BRCA, NOTCH4"

df1 <- read.csv(text = text1,header = TRUE,stringsAsFactors = FALSE)
df2 <- read.csv(text = text2,header = TRUE,stringsAsFactors = FALSE,sep = "|")

Next, we separate the interactions in df2, and use the resulting vector to create the logical variable in df1.

df2 <- separate_rows(df2,interactors)
df1$matches <- ifelse(df1$Gene %in% df2$interactors,TRUE,FALSE)
df1

...and the output:

> df1
  Gene     matches
1  ACE        TRUE
2 BRCA        TRUE
3 HER2       FALSE
> 

Upvotes: 0

Related Questions