Reputation: 65
I have two dataframes, one of them has a column of 100 genes and the other dataframe has a column that consist 700 rows and each row has several genes that are separated by comma, now I do not know how is it possible to select genes in each row of dataframe 2 according to the gene column in dataframe 1. In another word I want genes in each row of dataframe 2 that are in gene column of dataframe 1.
dataframe1:
column gene:
a
b
c
d
e
f
dataframe2:
column gene:
row1"a,b,c,d,r,t,y"
row2"c,g,h,k,l,a,b,c,p"
I only want comma separated genes in each row of dataframe2 that are in column gene of dataframe1 and other genes in dataframe 2 and are not in dataframe1 be removed.
Upvotes: 1
Views: 78
Reputation: 389095
A tidyverse
option using data from @thelatemail.
library(dplyr)
library(tidyr)
dataframe2 %>%
mutate(row = row_number()) %>%
separate_rows(gene, sep = ',') %>%
left_join(dataframe1 %>%
mutate(gene_sub = gene), by = 'gene') %>%
group_by(row) %>%
summarise(across(c(gene, gene_sub), ~toString(na.omit(.)))) %>%
select(-row)
# gene gene_sub
# <chr> <chr>
#1 a, b, c, d, r, t, y a, b, c, d
#2 c, g, h, k, l, a, b, c, p c, a, b, c
Upvotes: 1
Reputation: 7106
Using tidyverse:
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
dataframe1 <- tibble(gene = c("a", "b", "c", "d", "e", "f"))
dataframe2 <- tibble(gene = c("a,b,c,d,r,t,y","c,g,h,k,l,a,b,c,p"))
result <- str_extract_all(dataframe2$gene, rebus::or1(dataframe1$gene)) %>%
map(~ reduce(.x, str_c, sep = ','))
mutate(dataframe2, gene = result) %>% unnest(c(gene))
#> # A tibble: 2 x 1
#> gene
#> <chr>
#> 1 a,b,c,d
#> 2 c,a,b,c
Created on 2021-06-28 by the reprex package (v2.0.0)
Upvotes: 1
Reputation: 93908
Loop over each row of dataframe2$gene
with sapply
and keep only those values which are %in%
dataframe$gene1
, after strsplit
ing to get each comma-separated value.
dataframe1 <- data.frame(gene = c("a", "b", "c", "d", "e", "f"),
stringsAsFactors=FALSE)
dataframe2 <- data.frame(gene = c("a,b,c,d,r,t,y", "c,g,h,k,l,a,b,c,p"),
stringsAsFactors=FALSE)
dataframe2$gene_sub <- sapply(
strsplit(dataframe2$gene, ","),
function(x) paste(x[x %in% dataframe1$gene], collapse=",")
)
dataframe2
## gene gene_sub
##1 a,b,c,d,r,t,y a,b,c,d
##2 c,g,h,k,l,a,b,c,p c,a,b,c
Upvotes: 1