user2900006
user2900006

Reputation: 447

Reshaping data with missing values

I have a dataset that looks similar to this...

Id  Disease     Gene    Mutation    Expression
101 Disease_X   Gene_A  R273G       Normal
101 Disease_X   GENE_B  G12D        Normal
102 Disease_Y   GENE_C  L858R       High

and I want to reshape it such that each id and gene pair is represented for both 'Mutation and Expression, even if there is no value present.

For instance, each id-gene pair would have 6 possible values (3 genes for Mutation, 3 for Expression) and if there is no value in the original table for Mutation or Expression, the output would provide some standard output for missing data to fill that row (eg. "No Mutation data"). A table output would look like this:

Id  Disease     Type        Gene    Value
101 Disease_X   Mutation    Gene A  R273G
101 Disease_X   Mutation    GENE B  G12D
101 Disease_X   Mutation    GENE C  No Mutation Data
101 Disease_X   Expression  Gene A  Normal
101 Disease_X   Expression  GENE B  Normal
101 Disease_X   Expression  GENE C  No Expression Data 
102 Disease_Y   Mutation    Gene A  No Mutation Data
102 Disease_Y   Mutation    GENE B  No Mutation Data
102 Disease_Y   Mutation    GENE C  L858R
102 Disease_Y   Expression  Gene A  No Expression Value
102 Disease_Y   Expression  GENE B  No Expression Value
102 Disease_Y   Expression  GENE C  High

I know there is an easy way to do this (using merge or melt?) but I haven't come up with anything straightforward.

Upvotes: 2

Views: 1395

Answers (1)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193667

You need to do a few extra steps to get exactly what you seem to be looking for.

In the following, I start by making all of the combinations of "Id", "Type", and "Gene", merging that with a "long" form of your dataset, and then fixing the "Disease" column.

I've left the NA as NA since that seems to make more sense to me in case you need to do any further work.

This assumes you're starting with a dataset called "mydf".

library(data.table)
library(reshape2)

DT <- as.data.table(mydf)                                ## Convert to data.table
DTL <- melt(DT, id.vars = c("Id", "Disease", "Gene"))    ## Make it long
groups <- c("Id", "Gene", "variable")                    ## Save some typing
toMerge <- do.call(CJ, lapply(DTL[, groups,              ## Generate the combos
                                  with = FALSE], unique))
merged <- merge(DTL, toMerge, by = groups, all = TRUE)   ## merge
merged[, Disease := unique(na.omit(Disease)), by = Id][] ## Fill in Disease
#      Id   Gene   variable   Disease  value
#  1: 101 GENE_B   Mutation Disease_X   G12D
#  2: 101 GENE_B Expression Disease_X Normal
#  3: 101 GENE_C   Mutation Disease_X     NA
#  4: 101 GENE_C Expression Disease_X     NA
#  5: 101 Gene_A   Mutation Disease_X  R273G
#  6: 101 Gene_A Expression Disease_X Normal
#  7: 102 GENE_B   Mutation Disease_Y     NA
#  8: 102 GENE_B Expression Disease_Y     NA
#  9: 102 GENE_C   Mutation Disease_Y  L858R
# 10: 102 GENE_C Expression Disease_Y   High
# 11: 102 Gene_A   Mutation Disease_Y     NA
# 12: 102 Gene_A Expression Disease_Y     NA

Upvotes: 3

Related Questions