Compress the data frame by removing duplicate columns while keeping the extra corresponding information

Question

I apologise that it is difficult for me to describe my problem clearly. I herein present one example to express what I want to do.

I have a dataframe:

a = data.frame(gene = c("A", "A", "A", "B", "B", "C"), 
              id = c(100, 100, 30, 250, 250, 600),
              where = c("human", "flow", "apple", "human", "rock", "ghost"))

I want to remove the duplicated rows, while keep some information, and get an output like this:

  gene  id       where
   A   100, 30   human, flow, apple
   B   250       human, rock
   C   600       ghost

Thanks a lot for your help.

www · Accepted Answer

A solution using dplyr.

library(dplyr)

a2 <- a %>%
  group_by(gene) %>%
  summarize_all(list(~toString(unique(.))))
a2
# # A tibble: 3 x 3
#   gene  id      where             
#                    
# 1 A     100, 30 human, flow, apple
# 2 B     250     human, rock       
# 3 C     600     ghost

Or use data.table.

library(data.table)

setDT(a)[, lapply(.SD, function(x) toString(unique(x))), by = gene][]
#    gene      id              where
# 1:    A 100, 30 human, flow, apple
# 2:    B     250        human, rock
# 3:    C     600              ghost

Or base R.

aggregate(x = a[, !names(a) %in% "gene"], by = a[, "gene", drop = FALSE], 
          function(x) toString(unique(x)))
#   gene      id              where
# 1    A 100, 30 human, flow, apple
# 2    B     250        human, rock
# 3    C     600              ghost

Compress the data frame by removing duplicate columns while keeping the extra corresponding information

Answers (1)

Related Questions