bab2155
bab2155

Reputation: 83

Aggregating all unique values of each column of data frame

I have a large data frame (1616610 rows, 255 columns) and I need to paste together the unique values of each column based on a key.

For example:

> data = data.frame(a=c(1,1,1,2,2,3),
              b=c("apples", "oranges", "apples", "apples", "apples", "grapefruit"),
              c=c(12, 22, 22, 45, 67, 28), 
              d=c("Monday", "Monday", "Monday", "Tuesday", "Wednesday", "Tuesday"))
> data
  a          b  c         d
1 1     apples 12    Monday
2 1    oranges 22    Monday
3 1     apples 22    Monday
4 2     apples 45   Tuesday
5 2     apples 67 Wednesday
6 3 grapefruit 28   Tuesday

What I need is to aggregate each unique value in each of the 255 columns, and return a new data frame with comma separators for each unique value. Like this:

  a               b      c                  d
1 1 apples, oranges 12, 22             Monday
2 2          apples 45, 67 Tuesday, Wednesday
3 3      grapefruit     28           Thursday

I have tried using aggregate, like so:

output <- aggregate(data, by=list(data$a), paste, collapse=", ")

but for a data frame this size, it has been too time-intensive (hours), and often times I have to kill the process all together. On top of that, this will aggregate all values and not only the unique ones. Does anyone have any tips on:

1) how to improve the time of this aggregation for large data sets

2) then get the unique values of each field

BTW, this is my first post on SO, so thanks for your patience.

Upvotes: 8

Views: 3236

Answers (2)

steveb
steveb

Reputation: 5532

You could do the following with dplyr

Edit 1

Updated answer which eliminates the deprecation warning caused by using summarise_each (as of dplyr 0.7.0). This uses summarise & across instead of summarise_each.

library(dplyr)

func_paste <- function(x) paste(unique(x), collapse = ', ')
data %>%
  group_by(a) %>%
  summarise(across(everything(), func_paste))

# Without "func_paste", using paste directly (from Alistaire's comment).
data %>%
  group_by(a) %>%
  summarise(across(everything(), ~ paste(unique(.), collapse = ', ')))

## # A tibble: 3 × 4
##       a b               c      d
##   <dbl> <chr>           <chr>  <chr>
## 1     1 apples, oranges 12, 22 Monday
## 2     2 apples          45, 67 Tuesday, Wednesday
## 3     3 grapefruit      28     Tuesday

Previous answer, which will cause a deprecated warning (as of dplyr 0.7.0)

func_paste <- function(x) paste(unique(x), collapse = ', ')
data %>%
    group_by(a) %>%
    summarise_each(funs(func_paste))

##      a               b      c                  d
##  (dbl)           (chr)  (chr)              (chr)
##1     1 apples, oranges 12, 22             Monday
##2     2          apples 45, 67 Tuesday, Wednesday
##3     3      grapefruit     28            Tuesday

# Without "func_paste", using paste directly (from Alistaire's comment).
data %>%
  group_by(a) %>%
  summarise_each(funs(paste(unique(.), collapse = ', ')))

Upvotes: 5

G. Grothendieck
G. Grothendieck

Reputation: 269556

Moved from comments:

library(data.table)

dt <- as.data.table(data)
dt[, lapply(.SD, function(x) toString(unique(x))), by = a]

giving:

   a               b      c                  d
1: 1 apples, oranges 12, 22             Monday
2: 2          apples 45, 67 Tuesday, Wednesday
3: 3      grapefruit     28            Tuesday

Upvotes: 7

Related Questions