Reputation: 218
I have a dataset of genes and each genes interacting genes. I have this in 2 columns like this:
Gene Interacting Genes
ACE BRCA2, NOS2, SEPT9
HER2 AGT, TGRF
YUO SEPT9, NOS2
Separately I have a dataset which is just a list of genes. I am looking to creat a count
column of how many interacting genes per gene are also in my second dataset. My second dataset looking like:
Gene
NOS2
SEPT9
QRTY
Output from this example would look like:
Gene Interacting Genes Count
ACE BRCA2, NOS2, SEPT9 2
HER2 AGT, TGRF 0
YUO SEPT9 1
#NOS2 and SEPT9 are in the gene list dataframe and so are counted
I've seen similar questions but not ones that are doing a count within a string per each row, this is the part I am stuck on.
Input data:
#df1:
structure(list(Gene = c("ACE", "HER2", "YUO"), interactors = c("BRCA2, NOS2, SEPT9",
"AGT, TGRF",
"SEPT9, NOS2"
)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))
#df2:
structure(list(Gene = c("NOS2", "SEPT9", "QRTY")), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
Upvotes: 0
Views: 194
Reputation: 1972
You can use a solution based on dplyr and stringr.
library(dplyr)
library(stringr)
df1 %>%
mutate(count = str_count(interactors, str_c(df2$Gene, collapse = '|')))
# Gene interactors count
# 1 ACE BRCA2, NOS2, SEPT9 2
# 2 HER2 AGT, TGRF 0
# 3 YUO SEPT9, NOS2 2
Upvotes: 2
Reputation: 11584
Using str_extract_all:
> library(dplyr)
> library(stringr)
> df1 %>% mutate(counter = str_extract_all(interactors, paste0(df2$Gene, collapse = '|'))) %>%
+ rowwise() %>% mutate(count = length(counter)) %>% select(-counter)
# A tibble: 3 x 3
# Rowwise:
Gene interactors count
<chr> <chr> <int>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2 2
>
Upvotes: 0