Reputation: 396
I am trying to merge duplicate rows using data.table aggregate, but I need to figure out how to concatenate the non-duplicated columns as strings in the output:
dt = data.table(
ensembl_id=c("ENSRNOG00000055068", "ENSRNOG00000055068", "ENSRNOG00000055068"),
hsapiens_ensembl_id=c("ENSG00000196262", "ENSG00000236334", "ENSG00000263353"),
chr=c(14, 14, 14),
start=c(22706901, 22706901, 22706901),
hsapiens_symbol=c("PPIA", "PPIAL4G", "PPIAL4A"),
hsapiens_chr=c(7, 1, 1)
)
dt[, lapply(.SD, paste(...,sep=",")), by=ensembl_id] # <- need magic join/paste function
desired output:
ensembl_id hsapiens_ensembl_id chr start hsapiens_symbol hsapiens_chr
1: ENSRNOG00000055068 ENSG00000196262,ENSG00000236334,ENSG00000263353 14 22706901 PPIA,PPIAL4G,PPIAL4A 7,1,1
Upvotes: 2
Views: 751
Reputation: 886938
We can use collapse
with paste
instead of sep
and include 'chr', 'start' also as grouping variables
library(data.table)
dt[, lapply(.SD, paste, collapse=","), by = .(chr, start, ensembl_id)]
Or more compactly, with toString
dt[, lapply(.SD, toString), by = .(chr, start, ensembl_id)]
If we there are duplicates, get the unique
values and paste
dt[, lapply(.SD, function(x) toString(unique(x))), by = .(chr, start, ensembl_id)]
Upvotes: 4