dragon951
dragon951

Reputation: 396

R data.table merge duplicate rows and concatenate unique values

I am trying to merge duplicate rows using data.table aggregate, but I need to figure out how to concatenate the non-duplicated columns as strings in the output:

dt = data.table(
  ensembl_id=c("ENSRNOG00000055068", "ENSRNOG00000055068", "ENSRNOG00000055068"),
  hsapiens_ensembl_id=c("ENSG00000196262", "ENSG00000236334", "ENSG00000263353"),
  chr=c(14, 14, 14),
  start=c(22706901, 22706901, 22706901),
  hsapiens_symbol=c("PPIA", "PPIAL4G", "PPIAL4A"),
  hsapiens_chr=c(7, 1, 1)
)
dt[, lapply(.SD, paste(...,sep=",")), by=ensembl_id] # <- need magic join/paste function

desired output:

           ensembl_id                             hsapiens_ensembl_id chr    start      hsapiens_symbol hsapiens_chr
1: ENSRNOG00000055068 ENSG00000196262,ENSG00000236334,ENSG00000263353  14 22706901 PPIA,PPIAL4G,PPIAL4A        7,1,1

Upvotes: 2

Views: 751

Answers (1)

akrun
akrun

Reputation: 886938

We can use collapse with paste instead of sep and include 'chr', 'start' also as grouping variables

library(data.table)
dt[, lapply(.SD, paste, collapse=","), by = .(chr, start, ensembl_id)]

Or more compactly, with toString

dt[, lapply(.SD, toString), by = .(chr, start, ensembl_id)]

If we there are duplicates, get the unique values and paste

dt[, lapply(.SD, function(x) toString(unique(x))), by = .(chr, start, ensembl_id)]

Upvotes: 4

Related Questions