Reputation: 20446
I have a data.frame (link to file) with 18 columns and 11520 rows that I transform like this:
library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle),
numcolwise(median), na.rm=TRUE)
according to system.time(), it takes about this long to run:
user system elapsed
5.16 0.00 5.17
This call is part of a webapp, so run time is pretty important. Is there a way to speed this call up?
Upvotes: 8
Views: 2371
Reputation: 103938
Working with this data is considerably faster with dplyr:
library(dplyr)
system.time({
data %>%
group_by(groupname, starttime, fPhase, fCycle) %>%
summarise_each(funs(median(., na.rm = TRUE)), inadist:larct)
})
#> user system elapsed
#> 0.391 0.004 0.395
(You'll need dplyr 0.2 to get %>%
and summarise_each
)
This compares favourable to plyr:
library(plyr)
system.time({
df.median <- ddply(data, .(groupname, starttime, fPhase, fCycle),
numcolwise(median), na.rm = TRUE)
})
#> user system elapsed
#> 0.991 0.004 0.996
And to aggregate()
(code from @joshua-ulrich)
groupVars <- c("groupname", "starttime", "fPhase", "fCycle")
dataVars <- colnames(data)[ !(colnames(data) %in% c("location", groupVars))]
system.time({
ag.median <- aggregate(data[,dataVars], data[,groupVars], median)
})
#> user system elapsed
#> 0.532 0.005 0.537
Upvotes: 3
Reputation: 121127
The order of the data matter when you are calculating medians: if the data are in order from smallest to largest, then the calculation is a bit quicker.
x <- 1:1e6
y <- sample(x)
system.time(for(i in 1:1e2) median(x))
user system elapsed
3.47 0.33 3.80
system.time(for(i in 1:1e2) median(y))
user system elapsed
5.03 0.26 5.29
For the new datasets, sort the data by an appropriate column when you import it. For existing datasets you can sort them as a batch job (outside the web app).
Upvotes: 4
Reputation: 8533
To add to Joshua's solution. If you decide to use mean instead of median, you can speed up the computation another 4 times:
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
user system elapsed
3.472 0.020 3.615
> system.time(ag.mean <- aggregate(data[,dataVars], data[,groupVars], mean))
user system elapsed
0.936 0.008 1.006
Upvotes: 3
Reputation: 70068
Well i just did a few simple transformations on a large data frame (the baseball data set in the plyr package) using the standard library functions (e.g., 'table', 'tapply', 'aggregate', etc.) and the analogous plyr function--in each instance, i found plyr to be significantly slower. E.g.,
> system.time(table(BB$year))
user system elapsed
0.007 0.002 0.009
> system.time(ddply(BB, .(year), 'nrow'))
user system elapsed
0.183 0.005 0.189
Second, and i did not investigate whether this would improve performance in your case, but for data frames of the size you are working with now and larger, i use the data.table library, available on CRAN. It is simple to create data.table objects as well as to convert extant data.frames to data.tables--just call data.table on the data.frame you want to convert:
dt1 = data.table(my_dataframe)
Upvotes: 2
Reputation: 176688
Just using aggregate
is quite a bit faster...
> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
>
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
user system elapsed
1.89 0.00 1.89
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
user system elapsed
5.06 0.00 5.06
>
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
>
> identical(ag.median, df.median)
[1] TRUE
Upvotes: 9
Reputation: 100194
Just to summarize some of the points from the comments:
plyr
is designed primarily for ease-of-use, not for performance (although the recent version had some nice performance improvements). Some of the base functions are faster because they have less overhead. @JDLong pointed to a nice thread that covers some of these issues, including some specialized techniques from Hadley.Upvotes: 7