Reputation: 218
I've been trying to use answers to other similar questions but had no luck. I have 2 datasets:
#df1:
Gene
ACE
BRCA
HER2
#df2:
Gene interactors
GP5 ACE, NOS, C456
TP53 NOS, BRCA, NOTCH4
I am looking to add a column to my first dataset to identify genes which appear as interactors in my second dataset.
Output:
#df1:
Gene Matches
ACE TRUE
BRCA TRUE
HER2 FALSE
Currently I'm trying df1$Matches <- mapply(grepl, df1$Gene, df2$interactors)
This runs but when I increase the number of genes in df1 the number of matches drops, which doesn't make sense as I don't remove any genes that were ran originally, making me think this isn't working like I expect.
I've also tried:
library(stringr)
df1 %>%
+ rowwise() %>%
+ mutate(exists_in_title = str_detect(Gene, df2$interactors))
Error: Column `exists_in_title` must be length 1 (the group size), not 3654
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I've also tried a dplyr version of this with the same error.
What other ways I can approach this? Any help would be appreciated.
Input data:
dput(df1)
structure(list(Gene = c("ACE", "BRCA", "HER2")), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
dput(df2)
structure(list(Gene = c("GP5", "TP53"), interactors = c("ACE, NOS, C456",
"NOS, BRCA, NOTCH4")), row.names = c(NA, -2L), class = c("data.table",
"data.frame"))
Upvotes: 2
Views: 49
Reputation: 332
With base R
genes <- df1$Gene
res <- genes %in% trimws(unlist(strsplit(df2$interactors, ",")))
Result
> res
[1] TRUE TRUE FALSE
Which can be added onto df1 with
df1$Matches <- res
Upvotes: 0
Reputation: 2949
You can split use strsplit
library(dplyr)
df1$Matches <- df1$Gene %in% trimws(unlist(strsplit(df2$interactors, ",")))
> df1
Gene Matches
1 ACE TRUE
2 BRCA TRUE
3 HER2 FALSE
Upvotes: 1
Reputation: 10845
Here is an answer combining tidyr
and Base R. First, we read the data:
text1 <- "Gene
ACE
BRCA
HER2"
text2 <- "Gene|interactors
GP5|ACE, NOS, C456
TP53|NOS, BRCA, NOTCH4"
df1 <- read.csv(text = text1,header = TRUE,stringsAsFactors = FALSE)
df2 <- read.csv(text = text2,header = TRUE,stringsAsFactors = FALSE,sep = "|")
Next, we separate the interactions in df2
, and use the resulting vector to create the logical variable in df1
.
df2 <- separate_rows(df2,interactors)
df1$matches <- ifelse(df1$Gene %in% df2$interactors,TRUE,FALSE)
df1
...and the output:
> df1
Gene matches
1 ACE TRUE
2 BRCA TRUE
3 HER2 FALSE
>
Upvotes: 0