Reputation: 447
I have a dataset that looks similar to this...
Id Disease Gene Mutation Expression 101 Disease_X Gene_A R273G Normal 101 Disease_X GENE_B G12D Normal 102 Disease_Y GENE_C L858R High
and I want to reshape it such that each id and gene pair is represented for both 'Mutation and Expression, even if there is no value present.
For instance, each id-gene pair would have 6 possible values (3 genes for Mutation, 3 for Expression) and if there is no value in the original table for Mutation or Expression, the output would provide some standard output for missing data to fill that row (eg. "No Mutation data"). A table output would look like this:
Id Disease Type Gene Value 101 Disease_X Mutation Gene A R273G 101 Disease_X Mutation GENE B G12D 101 Disease_X Mutation GENE C No Mutation Data 101 Disease_X Expression Gene A Normal 101 Disease_X Expression GENE B Normal 101 Disease_X Expression GENE C No Expression Data 102 Disease_Y Mutation Gene A No Mutation Data 102 Disease_Y Mutation GENE B No Mutation Data 102 Disease_Y Mutation GENE C L858R 102 Disease_Y Expression Gene A No Expression Value 102 Disease_Y Expression GENE B No Expression Value 102 Disease_Y Expression GENE C High
I know there is an easy way to do this (using merge or melt?) but I haven't come up with anything straightforward.
Upvotes: 2
Views: 1395
Reputation: 193667
You need to do a few extra steps to get exactly what you seem to be looking for.
In the following, I start by making all of the combinations of "Id", "Type", and "Gene", merging that with a "long" form of your dataset, and then fixing the "Disease" column.
I've left the NA
as NA
since that seems to make more sense to me in case you need to do any further work.
This assumes you're starting with a dataset called "mydf".
library(data.table)
library(reshape2)
DT <- as.data.table(mydf) ## Convert to data.table
DTL <- melt(DT, id.vars = c("Id", "Disease", "Gene")) ## Make it long
groups <- c("Id", "Gene", "variable") ## Save some typing
toMerge <- do.call(CJ, lapply(DTL[, groups, ## Generate the combos
with = FALSE], unique))
merged <- merge(DTL, toMerge, by = groups, all = TRUE) ## merge
merged[, Disease := unique(na.omit(Disease)), by = Id][] ## Fill in Disease
# Id Gene variable Disease value
# 1: 101 GENE_B Mutation Disease_X G12D
# 2: 101 GENE_B Expression Disease_X Normal
# 3: 101 GENE_C Mutation Disease_X NA
# 4: 101 GENE_C Expression Disease_X NA
# 5: 101 Gene_A Mutation Disease_X R273G
# 6: 101 Gene_A Expression Disease_X Normal
# 7: 102 GENE_B Mutation Disease_Y NA
# 8: 102 GENE_B Expression Disease_Y NA
# 9: 102 GENE_C Mutation Disease_Y L858R
# 10: 102 GENE_C Expression Disease_Y High
# 11: 102 Gene_A Mutation Disease_Y NA
# 12: 102 Gene_A Expression Disease_Y NA
Upvotes: 3