Reputation: 189
I have this dataset and I would like to recast in a way that the ID.name
are the row. The Canonical_Hugo_Symbol
are the column name and the Canonical_Protein_Change
are the value of the cells. It will be great if there are no NA
but just 0 for the other cells.
mydata.df <- data.frame(ID.name = c("1000", "1000", "1000", "1001","1001","1001","1002","1002" ), Canonical_Protein_Change = c("p.Y1467H", "p.R1466W", "p.*427Q", "p.V320fs","p.S5383fs","p.D519V","p.S51A", "p.K183_splice" ), Canonical_Hugo_Symbol = c("gene1", "gene3", "gene1", "gene1","gene3","gene4","gene1", "gene2" ))
I have melt it:
ff.melt <- melt(mydata.df, id.var = c("ID.name", "Canonical_Hugo_Symbol"))
ff.melt
ID.name Canonical_Hugo_Symbol variable value
1 1000 gene1 Canonical_Protein_Change p.Y1467H
2 1000 gene3 Canonical_Protein_Change p.R1466W
3 1000 gene1 Canonical_Protein_Change p.*427Q
4 1001 gene1 Canonical_Protein_Change p.V320fs
5 1001 gene3 Canonical_Protein_Change p.S5383fs
6 1001 gene4 Canonical_Protein_Change p.D519V
7 1002 gene1 Canonical_Protein_Change p.S51A
8 1002 gene2 Canonical_Protein_Change p.K183_splice
Then I have recast it:
ff.cast <- dcast(ff.melt, ID.name ~ Canonical_Hugo_Symbol + value)
And I get this df
:
ff.cast
ID.name gene1_p.*427Q gene1_p.S51A gene1_p.V320fs gene1_p.Y1467H gene2_p.K183_splice gene3_p.R1466W gene3_p.S5383fs
1 1000 p.*427Q <NA> <NA> p.Y1467H <NA> p.R1466W <NA>
2 1001 <NA> <NA> p.V320fs <NA> <NA> <NA> p.S5383fs
3 1002 <NA> p.S51A <NA> <NA> p.K183_splice <NA> <NA>
gene4_p.D519V
1 <NA>
2 p.D519V
3 <NA>
It is close to what I want but now for each "gene" there are many column with different name. e.g. I want that gene1_p.*427Q
, gene1_p.S51A
, gene1_p.V320fs
, gene1_p.Y1467H
all in one column.
I also used:
dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value_var = "Canonical_Protein_Change" )
but I get this error message:
Error in .fun(.value[0], ...) : 2 arguments passed to 'length' which requires 1 >
Thanks
I would like to have this table or something like this! Thanks!
ID.name gene1 gene2 gene3 gene4
1 1000 Cp.*427Q 0 p.R1466W 0
2 1001 p.V320fs 0 p.S5383fs p.D519V
3 1002 p.S51A p.K183 0 0
when i tried I am getting closer but the colnames are wrong:
reshape(mydata.df, direction = 'wide', idvar = 'ID.name', timevar = 'Canonical_Hugo_Symbol')
I have fix the colnames:
colnames(mydata.reshape) <- sub("Canonical_Protein_Change.(.*?)","\\1", colnames(mydata.reshape))
But the NA are still there
Upvotes: 1
Views: 2929
Reputation: 67778
You may try this:
# concatenate values in cells with more than one value
dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value.var = "Canonical_Protein_Change",
fun.aggregate = function(x) paste(x, collapse = "; "), fill = "0")
# ID.name gene1 gene2 gene3 gene4
# 1 1000 p.Y1467H; p.*427Q 0 p.R1466W 0
# 2 1001 p.V320fs 0 p.S5383fs p.D519V
# 3 1002 p.S51A p.K183_splice 0 0
# ...or pick the first value in cells with more than one value
dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value.var = "Canonical_Protein_Change",
fun.aggregate = head, 1, fill = "0")
# ID.name gene1 gene2 gene3 gene4
# 1 1000 p.Y1467H 0 p.R1466W 0
# 2 1001 p.V320fs 0 p.S5383fs p.D519V
# 3 1002 p.S51A p.K183_splice 0 0
Upvotes: 2