Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How to merge a row in one column when there are duplicates in another

I have a dataframe as follows:

structure(list(ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz = c(40.9612938016794, 
40.9612938016794, 40.9612938016794, 40.9612938016794, 40.9612938016794, 
40.9612938016794, 40.9612938016794, 40.9612938016794), Stop = c(5864876, 
5864876, 5864876, 5864876, 5864876, 5864876, 5864876, 5864876
), MergeCol = c("1:4864876:5864876", "1:4864876:5864876", "1:4864876:5864876", 
"1:4864876:5864876", "1:4864876:5864876", "1:4864876:5864876", 
"1:4864876:5864876", "1:4864876:5864876"), hgnc_symbol = c("MIR4417", 
"MIR4689", "", "", "NPHP4", "", "", ""), chromosome_name = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz", 
"Stop", "MergeCol", "hgnc_symbol", "chromosome_name"), row.names = c(130L, 
289L, 432L, 446L, 1681L, 3467L, 3468L, 3469L), class = "data.frame")

I would like to merge the values in hgnc_symbol if there is a duplicate in MergCol. In this example I would end up with

structure(list(ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz = 40.9612938016794, Stop = 5864876, MergeCol = "1:4864876:5864876", hgnc_symbol = "MIR4417,MIR4689,NPHP4", chromosome_name = 1L), .Names = c("ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz", "Stop", "MergeCol", "hgnc_symbol", "chromosome_name"), row.names = 130L, class = "data.frame")

Is there a way of collapsing a row in one column based on another column being duplicated?

Upvotes: 1

Views: 64

Answers (1)

Sam Dickson
Sam Dickson

Reputation: 5239

It seems like your trying to make a dataframe do everything a list can do, but if you're willing to combine a vector into a character string, you might do it like this:

library(plyr)
ddply(df,names(df)[-4],summarize,
      hgnc_symbol=paste(unique(hgnc_symbol[hgnc_symbol!=""]),collapse=","))

#   ZSSLX.10456.FastSeqA.BiopsyTumour_LCM_PS14_1105_1F_100PCent.gz    Stop          MergeCol  chromosome_name             hgnc_symbol
# 1                                                       40.96129 5864876 1:4864876:5864876                1 MIR4417, MIR4689, NPHP4

Note that this assumes that every time MergeCol is duplicated that everything else except hgnc_symbol is duplicated, too. You can apply the same concept to the other columns that are not necessarily duplicated as well.

Upvotes: 1

Related Questions