Reputation: 978
This is the data-frame
A
gene
ENSG00000000938
ENSG00000001617
ENSG00000001630
ENSG00000002330
ENSG00000002726
ENSG00000002822
The row which would be common across the other data frame is the third column
Expression dataframe
gene log2FoldChange
ENSG00000000938 4.247333
ENSG00000001617 3.381206
ENSG00000001630 2.148626
ENSG00000002330 2.876010
ENSG00000002822 2.508205
> Region dataframe
gene log2FoldChange
ENSG00000000938 8.247333
ENSG00000001617 1.381206
ENSG00000001630 6.148626
ENSG00000002330 3.876010
ENSG00000002726 1.602266
So i want to map the A with both the column expression dataframe and Region dataframe
Such that I would like to to end up like this
My final data-frame should be looking something like this
Output expected
gene log2FoldChange_Expression log2FoldChange_Region
ENSG00000000938 4.247333 8.247333
ENSG00000001617 3.381206 1.381206
ENSG00000001630 2.148626 6.148626
ENSG00000002330 2.876010 3.876010
The approach i tried is
Step 1 Inner join the dataframe A with Expression dataframe Step 2 Inner Join the datafameA with region dataframe
The trying to paste each dataframe in a single dataframe. Im not sure if this is the right approach
Upvotes: 1
Views: 307
Reputation: 11
Personally, I would do this by matching the order of the two datasets and using cbind to bind the columns from each dataset together, now they're sorted. Assuming the two datasets have the same set of genes:
Exp_data<-Exp_data[match(Reg_data$gene,Exp_data$gene),]
#reorder the rows of Exp_data so the gene names match the order found in Reg_data
identical(Exp_data$gene,Reg_data$gene)
#If the genes are now in exactly the same order in Exp_data and Reg_data, this should return 'TRUE'
combined_data<-cbind(Exp_data,Reg_data)
#bind the datasets together in column by column manner and assign it to a new object called combined_data
I find this a very straightforward way of thinking about this but I always check the ordering of lists, as 'match' is easily misused.
Upvotes: 1
Reputation: 27732
Here is a data.table
approach, so make sure your datasets are in data.table format. If not, you can uncomment the setDT(...) line in the code below.
sample data
library(data.table)
library(tibble)
A <- fread("gene
ENSG00000000938
ENSG00000001617
ENSG00000001630
ENSG00000002330
ENSG00000002726
ENSG00000002822")
expression <- fread("gene log2FoldChange
ENSG00000000938 4.247333
ENSG00000001617 3.381206
ENSG00000001630 2.148626
ENSG00000002330 2.876010
ENSG00000002822 2.508205")
region <- fread("gene log2FoldChange
ENSG00000000938 8.247333
ENSG00000001617 1.381206
ENSG00000001630 6.148626
ENSG00000002330 3.876010
ENSG00000002726 1.602266")
code
#put data in a named list,
L <- tibble::lst(A, expression, region)
#loop over the list's L indexes
L <- lapply(seq_along(L), function(x) {
#if data is not yet in data.table format, convert it
# setDT(L[[x]]) # <-!! uncomment if your data is not already data.table
#set the key (gene) to join on
setkey(L[[x]], gene)
#if the data has more than two columns, add a suffix to the columns 2 and higher
#based on the name if the data (derived from the list L's names)
if (x > 1) {
output <- setnames(L[[x]],
grep("^(?!gene)", names(L[[x]]), value = TRUE, perl = TRUE),
paste0(grep("^(?!gene)", names(L[[x]]), value = TRUE, perl = TRUE), "_", names(L)[x]))
} else {
#else return the data as-is
output <- L[[x]]
}
#return output
return(output)
})
#step over list L, and merge them al together
#reduce join
Reduce( function(...) merge(..., all = TRUE), L )
output
# gene log2FoldChange_expression log2FoldChange_region
# 1: ENSG00000000938 4.247333 8.247333
# 2: ENSG00000001617 3.381206 1.381206
# 3: ENSG00000001630 2.148626 6.148626
# 4: ENSG00000002330 2.876010 3.876010
# 5: ENSG00000002726 NA 1.602266
# 6: ENSG00000002822 2.508205 NA
Upvotes: 1