How to use dplyr as alternative to aggregate

Question

I have a dataframe times that looks like this:

user     time
A        7/7/2010
B        7/12/2010
C        7/12/2010
A        7/12/2010 
C        7/15/2010

I'm using aggregate(time ~ user, times, function(x) sort(as.vector(x))) to get this:

user     time
A        c(7/7/2010, 7/12/2010)
B        c(7/12/2010)
C        c(7/12/2010, 7/15/2010)

The problem is that I have over 20 million entries in times so aggregate is taking a over 4 hours. Is there any alternative using dplyr that will get me the sorted vector of dates?

eipi10 · Accepted Answer

Updated Answer: Based on your comment, how about this:

library(dplyr)

# Data (with a few additions)
times = read.table(text="user     time
A        7/7/2010
B        7/12/2010
B 7/13/2010
C        7/12/2010
A        7/12/2010 
A 7/13/2010
C        7/15/2010", header=TRUE, stringsAsFactors=FALSE)

times$time = as.Date(times$time, "%m/%d/%Y")

times

  user       time
1    A 2010-07-07
2    B 2010-07-12
3    B 2010-07-13
4    C 2010-07-12
5    A 2010-07-12
6    A 2010-07-13
7    C 2010-07-15

times %>% group_by(user) %>%
  summarise(First=min(time),
            Last=max(time),
            N = n(),
            minDiff=min(diff(time)),
            meanDiff=mean(diff(time)),
            NumDiffUniq = length(unique(diff(time))))

   user      First       Last     N        minDiff       meanDiff NumDiffUniq
1     A 2010-07-07 2010-07-13     3         1 days         3 days           2
2     B 2010-07-12 2010-07-13     2         1 days         1 days           1
3     C 2010-07-12 2010-07-15     2         3 days         3 days           1

Original Answer:

I'm not clear on what you're trying to accomplish. If you just want your data frame to be sorted, then with dplyr you would do:

library(dplyr)

times.sorted = times %>% arrange(user, time)

If you want time to become a string of dates for each user, then you could do:

times.summary = times %>% group_by(user) %>%
  summarise(time = paste(time, collapse=","))

But note that for each user this will result in a single string containing the dates.

times.summary

   user                time
1     A  7/7/2010,7/12/2010
2     B           7/12/2010
3     C 7/12/2010,7/15/2010

If you actually want each cell to be a vector of dates, you could make each cell a list (though there might be a better way). For example:

times.new = times %>% group_by(user) %>%
  summarise(time = list(as.vector(time)))

times.new$time

[[1]]
[1] "7/7/2010"  "7/12/2010"

[[2]]
[1] "7/12/2010"

[[3]]
[1] "7/12/2010" "7/15/2010"

But if your goal is to analyze your data by group, then you don't actually need to do any of the above. You can use base, dplyr, or data.table functions to perform any analysis by group without first sorting your data.

How to use dplyr as alternative to aggregate

Answers (2)

Related Questions