Concatenate duplicate dataframe values in R

Question

I have a very long dataframe where 1 column out of nearly 56 has many different values, while the rest of the data change in accordance with the first column ID. Here's an example

ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a       SNP         Hom         NM_033486
0   chr1    1590327 1590328 a       SNP         Hom         NM_033487
0   chr1    1590327 1590328 a       SNP         Hom         NM_033488
0   chr1    1590327 1590328 a       SNP         Hom         NM_033489
0   chr1    1590327 1590328 a       SNP         Hom         NM_033492
0   chr1    1590327 1590328 a       SNP         Hom         NM_033493
1   chr1    1590526 1590527 g       SNP         Hom         NM_033486
1   chr1    1590526 1590527 g       SNP         Hom         NM_033487
1   chr1    1590526 1590527 g       SNP         Hom         NM_033488
1   chr1    1590526 1590527 g       SNP         Hom         NM_033489
1   chr1    1590526 1590527 g       SNP         Hom         NM_033492

The desired result would be to concatenate any duplicate values into a comma seperated string but maintain the ID only once, like this

ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a       SNP         Hom         NM_033486NM_033487,NM_033488,NM_033489,NM_033492,NM_033493
1   chr1    1590526 1590527 g       SNP         Hom         NM_033486,NM_033487,NM_033488,NM_033489,NM_033492

I've searched for similar questions and the following solutions haven't worked so far; instead they return me a zero row dataframe.

LyzandeR · Accepted Answer

One way with data.table:

library(data.table)
#setDT will convert the data.frame into data.table
#.SD gives you access to the groups of data.tables created by the 'by' argument
setDT(df)[, list(transcript_name = paste(transcript_name, collapse = ', ')), 
            by = c('ID', 'chrom', 'left', 'right', 'ref_seq', 'var_type', 'zygosity')]
#   ID chrom    left   right ref_seq var_type zygosity                                                  transcript_name
#1:  0  chr1 1590327 1590328       a      SNP      Hom NM_033486, NM_033487, NM_033488, NM_033489, NM_033492, NM_033493
#2:  1  chr1 1590526 1590527       g      SNP      Hom            NM_033486, NM_033487, NM_033488, NM_033489, NM_033492

Data

df <- read.table(header = TRUE, text = 'ID  chrom   left    right   ref_seq var_type    zygosity    transcript_name
0   chr1    1590327 1590328 a   SNP Hom NM_033486
                 0   chr1    1590327 1590328 a   SNP Hom NM_033487
                 0   chr1    1590327 1590328 a   SNP Hom NM_033488
                 0   chr1    1590327 1590328 a   SNP Hom NM_033489
                 0   chr1    1590327 1590328 a   SNP Hom NM_033492
                 0   chr1    1590327 1590328 a   SNP Hom NM_033493
                 1   chr1    1590526 1590527 g   SNP Hom NM_033486
                 1   chr1    1590526 1590527 g   SNP Hom NM_033487
                 1   chr1    1590526 1590527 g   SNP Hom NM_033488
                 1   chr1    1590526 1590527 g   SNP Hom NM_033489
                 1   chr1    1590526 1590527 g   SNP Hom NM_033492')

Concatenate duplicate dataframe values in R

Answers (2)

Related Questions