Reputation: 6874
I have a dataframe as follows:
structure(list(ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz = c(40.9612938016794,
40.9612938016794, 40.9612938016794, 40.9612938016794, 40.9612938016794,
40.9612938016794, 40.9612938016794, 40.9612938016794), Stop = c(5864876,
5864876, 5864876, 5864876, 5864876, 5864876, 5864876, 5864876
), MergeCol = c("1:4864876:5864876", "1:4864876:5864876", "1:4864876:5864876",
"1:4864876:5864876", "1:4864876:5864876", "1:4864876:5864876",
"1:4864876:5864876", "1:4864876:5864876"), hgnc_symbol = c("MIR4417",
"MIR4689", "", "", "NPHP4", "", "", ""), chromosome_name = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz",
"Stop", "MergeCol", "hgnc_symbol", "chromosome_name"), row.names = c(130L,
289L, 432L, 446L, 1681L, 3467L, 3468L, 3469L), class = "data.frame")
I would like to merge the values in hgnc_symbol if there is a duplicate in MergCol. In this example I would end up with
structure(list(ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz = 40.9612938016794, Stop = 5864876, MergeCol = "1:4864876:5864876", hgnc_symbol = "MIR4417,MIR4689,NPHP4", chromosome_name = 1L), .Names = c("ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz", "Stop", "MergeCol", "hgnc_symbol", "chromosome_name"), row.names = 130L, class = "data.frame")
Is there a way of collapsing a row in one column based on another column being duplicated?
Upvotes: 1
Views: 64
Reputation: 5239
It seems like your trying to make a dataframe do everything a list can do, but if you're willing to combine a vector into a character string, you might do it like this:
library(plyr)
ddply(df,names(df)[-4],summarize,
hgnc_symbol=paste(unique(hgnc_symbol[hgnc_symbol!=""]),collapse=","))
# ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz Stop MergeCol chromosome_name hgnc_symbol
# 1 40.96129 5864876 1:4864876:5864876 1 MIR4417, MIR4689, NPHP4
Note that this assumes that every time MergeCol
is duplicated that everything else except hgnc_symbol
is duplicated, too. You can apply the same concept to the other columns that are not necessarily duplicated as well.
Upvotes: 1