theomega
theomega

Reputation: 32051

Faster alternative to doBy/summaryBy

I currently use the summaryBy command from the doBy package to group rows of a data frame by specific functions. This works fine. BUT:

The doBy package loads very slow, I think because it imports various other packages. It takes about 3 seconds until doBy is loaded. I only need the simple summaryBy feature from this package.

Is there a possibility to speed up the loading time of the package or is there a alternative implementation which does not load such a huge package?

Upvotes: 3

Views: 4605

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269885

1) Rather than installing the doBy package, try sourcing summaryBy.R and orderBy.R from the doBy source package:

setwd("doBy/R")
source("summaryBy.R")
source("orderBy.R")

summaryBy(...whatever...)

or

2) remove all files in the package except the DESCRIPTION file, the R directory and those two source files (remove all other .R files), remove the Depends: and Imports: lines from the DESCRIPTION file (optionally change the Package: line in the DESCRIPTION to some other name) and then rebuild and install the new stripped down package. (Another possibility is to leave all the files in the package and just delete the Depends: and Imports: lines from the DESCRIPTION file but that won't load quite as fast as removing nearly everything).

Upvotes: 3

IRTFM
IRTFM

Reputation: 263441

You may get more rapid performance by just using the base-R lapply(split(.)) paradigm with the functions you want.

 dat <- structure(list(category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), year = c(2000, 2001, 2004, 2005, 
2009, 2010, 2000, 2001, 2004, 2005, 2009, 2010, 2000, 2001, 2004, 
2005, 2009, 2010), incidents = c(7, 4, 4, 2, 3, 1, 6, 3, 5, 2, 
2, 5, 2, 1, 4, 4, 2, 1)), .Names = c("category", "year", "incidents"
), row.names = c(NA, -18L), class = "data.frame")

split(dat, dat$category)
lapply( split(dat[-1], dat$category), summary)

Upvotes: 6

Zach
Zach

Reputation: 30321

For aggregating large datasets with complicated functions, it's hard to beat the data.table package. For example, here's how you would summarize mean and sd of Sepal.Length for the iris dataset:

require(data.table)
dat <- data.table(iris)
dat[,list(mean=mean(Sepal.Length), sd=sd(Sepal.Length)),by=Species]

The library loads quickly, it only takes 1 line of code (2 if you count converting your data.frame to a data.table), and it's very fast. What more could you want?

Upvotes: 15

Related Questions