Reputation: 65

select some values between 2 dataframe

I have two dataframes, one of them has a column of 100 genes and the other dataframe has a column that consist 700 rows and each row has several genes that are separated by comma, now I do not know how is it possible to select genes in each row of dataframe 2 according to the gene column in dataframe 1. In another word I want genes in each row of dataframe 2 that are in gene column of dataframe 1.

dataframe1:
column gene:
a
b
c
d
e
f

dataframe2:
column gene: 
row1"a,b,c,d,r,t,y"
row2"c,g,h,k,l,a,b,c,p"

I only want comma separated genes in each row of dataframe2 that are in column gene of dataframe1 and other genes in dataframe 2 and are not in dataframe1 be removed.

Upvotes: 1

Answers (3)

Ronak Shah

Reputation: 389095

A tidyverse option using data from @thelatemail.

library(dplyr)
library(tidyr)

dataframe2 %>%
  mutate(row =  row_number()) %>%
  separate_rows(gene, sep = ',') %>%
  left_join(dataframe1 %>%
            mutate(gene_sub = gene), by = 'gene') %>%
  group_by(row) %>%
  summarise(across(c(gene, gene_sub), ~toString(na.omit(.)))) %>%
  select(-row)

#  gene                      gene_sub  
#  <chr>                     <chr>     
#1 a, b, c, d, r, t, y       a, b, c, d
#2 c, g, h, k, l, a, b, c, p c, a, b, c

Upvotes: 1

jpdugo17

Reputation: 7106

Using tidyverse:

library(tidyverse)
library(rebus)
#> 
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#> 
#>     regex
#> The following object is masked from 'package:ggplot2':
#> 
#>     alpha

dataframe1 <- tibble(gene = c("a", "b", "c", "d", "e", "f"))
dataframe2 <- tibble(gene = c("a,b,c,d,r,t,y","c,g,h,k,l,a,b,c,p"))

result <- str_extract_all(dataframe2$gene, rebus::or1(dataframe1$gene)) %>%
            map(~ reduce(.x, str_c, sep = ','))

mutate(dataframe2, gene = result) %>% unnest(c(gene))
#> # A tibble: 2 x 1
#>   gene   
#>   <chr>  
#> 1 a,b,c,d
#> 2 c,a,b,c

^{Created on 2021-06-28 by the reprex package (v2.0.0)}

Upvotes: 1

thelatemail

Reputation: 93908

Loop over each row of dataframe2$gene with sapply and keep only those values which are %in% dataframe$gene1, after strspliting to get each comma-separated value.

dataframe1 <- data.frame(gene = c("a", "b", "c", "d", "e", "f"),
                         stringsAsFactors=FALSE)
dataframe2 <- data.frame(gene = c("a,b,c,d,r,t,y", "c,g,h,k,l,a,b,c,p"),
                         stringsAsFactors=FALSE)

dataframe2$gene_sub <- sapply(
    strsplit(dataframe2$gene, ","),
    function(x) paste(x[x %in% dataframe1$gene], collapse=",")
)

dataframe2
##               gene gene_sub
##1     a,b,c,d,r,t,y  a,b,c,d
##2 c,g,h,k,l,a,b,c,p  c,a,b,c

Upvotes: 1

select some values between 2 dataframe

Answers (3)

Related Questions