Reputation: 29
I have a dataframe times
that looks like this:
user time
A 7/7/2010
B 7/12/2010
C 7/12/2010
A 7/12/2010
C 7/15/2010
I'm using aggregate(time ~ user, times, function(x) sort(as.vector(x)))
to get this:
user time
A c(7/7/2010, 7/12/2010)
B c(7/12/2010)
C c(7/12/2010, 7/15/2010)
The problem is that I have over 20 million entries in times
so aggregate
is taking a over 4 hours. Is there any alternative using dplyr
that will get me the sorted vector of dates?
Upvotes: 1
Views: 3995
Reputation: 29
Based on the dplyr
solution by eipi10 and the suggestion of nrussell, I've written the following solution using data.table
.
First you need to format the variable times
:
times$time = as.Date(times$time, "%m/%d/%Y")
Then you'll need to convert times
to a data.table using:
library(data.table)
times <- as.data.table(times)
Overwriting times
was useful for my purposes but you may want to instantiate a new variable. After formatting your dataframe as a data.table just do:
new.times <- times[,
.(first = min(time),
last = max(time),
n = .N,
meandiff = mean(diff(time)),
mindiff = min(diff(time)),
numdiffuniq = length(unique(diff(time))),
by='user')]
Running on a linux virtual machine with 128G RAM and using a sample of 1000 entires, the elapsed runtime was 0.43s.
See this tutorial for more on data.table.
Upvotes: 0
Reputation: 93811
Updated Answer: Based on your comment, how about this:
library(dplyr)
# Data (with a few additions)
times = read.table(text="user time
A 7/7/2010
B 7/12/2010
B 7/13/2010
C 7/12/2010
A 7/12/2010
A 7/13/2010
C 7/15/2010", header=TRUE, stringsAsFactors=FALSE)
times$time = as.Date(times$time, "%m/%d/%Y")
times
user time 1 A 2010-07-07 2 B 2010-07-12 3 B 2010-07-13 4 C 2010-07-12 5 A 2010-07-12 6 A 2010-07-13 7 C 2010-07-15
times %>% group_by(user) %>%
summarise(First=min(time),
Last=max(time),
N = n(),
minDiff=min(diff(time)),
meanDiff=mean(diff(time)),
NumDiffUniq = length(unique(diff(time))))
user First Last N minDiff meanDiff NumDiffUniq 1 A 2010-07-07 2010-07-13 3 1 days 3 days 2 2 B 2010-07-12 2010-07-13 2 1 days 1 days 1 3 C 2010-07-12 2010-07-15 2 3 days 3 days 1
Original Answer:
I'm not clear on what you're trying to accomplish. If you just want your data frame to be sorted, then with dplyr
you would do:
library(dplyr)
times.sorted = times %>% arrange(user, time)
If you want time
to become a string of dates for each user
, then you could do:
times.summary = times %>% group_by(user) %>%
summarise(time = paste(time, collapse=","))
But note that for each user this will result in a single string containing the dates.
times.summary
user time 1 A 7/7/2010,7/12/2010 2 B 7/12/2010 3 C 7/12/2010,7/15/2010
If you actually want each cell to be a vector of dates, you could make each cell a list (though there might be a better way). For example:
times.new = times %>% group_by(user) %>%
summarise(time = list(as.vector(time)))
times.new$time
[[1]] [1] "7/7/2010" "7/12/2010" [[2]] [1] "7/12/2010" [[3]] [1] "7/12/2010" "7/15/2010"
But if your goal is to analyze your data by group, then you don't actually need to do any of the above. You can use base, dplyr
, or data.table
functions to perform any analysis by group without first sorting your data.
Upvotes: 3