PesKchan
PesKchan

Reputation: 978

Map two different data-frame with common row and combining them together

This is the data-frame

A

 gene
 ENSG00000000938
 ENSG00000001617
 ENSG00000001630
 ENSG00000002330
 ENSG00000002726
 ENSG00000002822    

The row which would be common across the other data frame is the third column

Expression dataframe

gene log2FoldChange
     ENSG00000000938       4.247333
     ENSG00000001617       3.381206
     ENSG00000001630       2.148626
     ENSG00000002330       2.876010
     ENSG00000002822       2.508205

> Region dataframe

gene log2FoldChange
     ENSG00000000938       8.247333
     ENSG00000001617       1.381206
     ENSG00000001630       6.148626
     ENSG00000002330       3.876010
     ENSG00000002726       1.602266

So i want to map the A with both the column expression dataframe and Region dataframe Such that I would like to to end up like this

My final data-frame should be looking something like this

Output expected

gene                log2FoldChange_Expression log2FoldChange_Region
 ENSG00000000938          4.247333                8.247333
 ENSG00000001617          3.381206                1.381206
 ENSG00000001630          2.148626                6.148626
 ENSG00000002330          2.876010                3.876010

The approach i tried is

Step 1 Inner join the dataframe A with Expression dataframe Step 2 Inner Join the datafameA with region dataframe

The trying to paste each dataframe in a single dataframe. Im not sure if this is the right approach

Upvotes: 1

Views: 307

Answers (2)

Terence Garner
Terence Garner

Reputation: 11

Personally, I would do this by matching the order of the two datasets and using cbind to bind the columns from each dataset together, now they're sorted. Assuming the two datasets have the same set of genes:

Exp_data<-Exp_data[match(Reg_data$gene,Exp_data$gene),] #reorder the rows of Exp_data so the gene names match the order found in Reg_data

identical(Exp_data$gene,Reg_data$gene) #If the genes are now in exactly the same order in Exp_data and Reg_data, this should return 'TRUE'

combined_data<-cbind(Exp_data,Reg_data) #bind the datasets together in column by column manner and assign it to a new object called combined_data

I find this a very straightforward way of thinking about this but I always check the ordering of lists, as 'match' is easily misused.

Upvotes: 1

Wimpel
Wimpel

Reputation: 27732

Here is a data.table approach, so make sure your datasets are in data.table format. If not, you can uncomment the setDT(...) line in the code below.

sample data

library(data.table)
library(tibble)
A <- fread("gene
 ENSG00000000938
 ENSG00000001617
 ENSG00000001630
 ENSG00000002330
 ENSG00000002726
 ENSG00000002822")
expression <- fread("gene log2FoldChange
     ENSG00000000938       4.247333
     ENSG00000001617       3.381206
     ENSG00000001630       2.148626
     ENSG00000002330       2.876010
     ENSG00000002822       2.508205")
region <- fread("gene log2FoldChange
     ENSG00000000938       8.247333
     ENSG00000001617       1.381206
     ENSG00000001630       6.148626
     ENSG00000002330       3.876010
     ENSG00000002726       1.602266")

code

#put data in a named list, 
L <- tibble::lst(A, expression, region)
#loop over the list's L indexes
L <- lapply(seq_along(L), function(x) {
  #if data is not yet in data.table format, convert it
  # setDT(L[[x]])  # <-!! uncomment if your data is not already data.table
  #set the key (gene) to join on
  setkey(L[[x]], gene)
  #if the data has more than two columns, add a suffix to the columns 2 and higher
  #based on the name if the data (derived from the list L's names)
  if (x > 1) {
    output <- setnames(L[[x]], 
             grep("^(?!gene)", names(L[[x]]), value = TRUE, perl = TRUE),
             paste0(grep("^(?!gene)", names(L[[x]]), value = TRUE, perl = TRUE), "_", names(L)[x]))
  } else {
  #else return the data as-is
    output <- L[[x]]
  }
  #return output
  return(output)
})
#step over list L, and merge them al together
#reduce join
Reduce( function(...) merge(..., all = TRUE), L )

output

#               gene log2FoldChange_expression log2FoldChange_region
# 1: ENSG00000000938                  4.247333              8.247333
# 2: ENSG00000001617                  3.381206              1.381206
# 3: ENSG00000001630                  2.148626              6.148626
# 4: ENSG00000002330                  2.876010              3.876010
# 5: ENSG00000002726                        NA              1.602266
# 6: ENSG00000002822                  2.508205                    NA

Upvotes: 1

Related Questions