Reputation: 423
I have a very long dataframe where 1 column out of nearly 56 has many different values, while the rest of the data change in accordance with the first column ID. Here's an example
ID chrom left right ref_seq var_type zygosity transcript_name
0 chr1 1590327 1590328 a SNP Hom NM_033486
0 chr1 1590327 1590328 a SNP Hom NM_033487
0 chr1 1590327 1590328 a SNP Hom NM_033488
0 chr1 1590327 1590328 a SNP Hom NM_033489
0 chr1 1590327 1590328 a SNP Hom NM_033492
0 chr1 1590327 1590328 a SNP Hom NM_033493
1 chr1 1590526 1590527 g SNP Hom NM_033486
1 chr1 1590526 1590527 g SNP Hom NM_033487
1 chr1 1590526 1590527 g SNP Hom NM_033488
1 chr1 1590526 1590527 g SNP Hom NM_033489
1 chr1 1590526 1590527 g SNP Hom NM_033492
The desired result would be to concatenate any duplicate values into a comma seperated string but maintain the ID only once, like this
ID chrom left right ref_seq var_type zygosity transcript_name
0 chr1 1590327 1590328 a SNP Hom NM_033486NM_033487,NM_033488,NM_033489,NM_033492,NM_033493
1 chr1 1590526 1590527 g SNP Hom NM_033486,NM_033487,NM_033488,NM_033489,NM_033492
I've searched for similar questions and the following solutions haven't worked so far; instead they return me a zero row dataframe.
Upvotes: 5
Views: 1808
Reputation: 37879
One way with data.table
:
library(data.table)
#setDT will convert the data.frame into data.table
#.SD gives you access to the groups of data.tables created by the 'by' argument
setDT(df)[, list(transcript_name = paste(transcript_name, collapse = ', ')),
by = c('ID', 'chrom', 'left', 'right', 'ref_seq', 'var_type', 'zygosity')]
# ID chrom left right ref_seq var_type zygosity transcript_name
#1: 0 chr1 1590327 1590328 a SNP Hom NM_033486, NM_033487, NM_033488, NM_033489, NM_033492, NM_033493
#2: 1 chr1 1590526 1590527 g SNP Hom NM_033486, NM_033487, NM_033488, NM_033489, NM_033492
Data
df <- read.table(header = TRUE, text = 'ID chrom left right ref_seq var_type zygosity transcript_name
0 chr1 1590327 1590328 a SNP Hom NM_033486
0 chr1 1590327 1590328 a SNP Hom NM_033487
0 chr1 1590327 1590328 a SNP Hom NM_033488
0 chr1 1590327 1590328 a SNP Hom NM_033489
0 chr1 1590327 1590328 a SNP Hom NM_033492
0 chr1 1590327 1590328 a SNP Hom NM_033493
1 chr1 1590526 1590527 g SNP Hom NM_033486
1 chr1 1590526 1590527 g SNP Hom NM_033487
1 chr1 1590526 1590527 g SNP Hom NM_033488
1 chr1 1590526 1590527 g SNP Hom NM_033489
1 chr1 1590526 1590527 g SNP Hom NM_033492')
Upvotes: 4
Reputation: 3587
Another solution using base R
aggregate(data=df,transcript_name~.,FUN=paste,collapse=",")
Thanks to @Sotos & @LyzandeR for collapse
Upvotes: 8