Reputation: 2444
I have the following, somewhat large dataset:
> dim(dset)
[1] 422105 25
> class(dset)
[1] "data.frame"
>
Without doing anything, the R process seems to take about 1GB of RAM.
I am trying to run the following code:
dset <- ddply(dset, .(tic), transform,
date.min <- min(date),
date.max <- max(date),
daterange <- max(date) - min(date),
.parallel = TRUE)
Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong?
Upvotes: 9
Views: 883
Reputation: 433
Are there many numbers of factor levels in the data frame? I've found that this type of excessive memory usage is common in adply and possibly other plyr functions, but can be remedied by removing unnecessary factors and levels. If the large data frame was read into R, make sure stringsAsFactors is set to FALSE in the import:
dat = read.csv(header=TRUE, sep="\t", file="dat.tsv", stringsAsFactors=FALSE)
Then assign the factors you actually need.
I haven't look into Hadley's source yet to discover why.
Upvotes: 1
Reputation: 162401
Here's an alternative application of data.table
to the problem, illustrating how blazing-fast it can be. (Note: this uses dset
, the data.frame
constructed by Brian Diggs in his answer, except with 30000 rather than 10 levels of tic
).
(The reason this is much faster than @joran's solution, is that it avoids the use of .SD
, instead using the columns directly. The style's a bit different than plyr
, but it typically buys huge speed-ups. For another example, see the data.table
Wiki which: (a) includes this as recommendation #1 ; and (b) shows a 50X speedup for code that drops the .SD
).
library(data.table)
system.time({
dt <- data.table(dset, key="tic")
# Summarize by groups and store results in a summary data.table
sumdt <- dt[ ,list(min.date=min(date), max.date=max(date)), by="tic"]
sumdt[, daterange:= max.date-min.date]
# Merge the summary data.table back into dt, based on key
dt <- dt[sumdt]
})
# ELAPSED TIME IN SECONDS
# user system elapsed
# 1.45 0.25 1.77
Upvotes: 10
Reputation: 173657
If performance is an issue, it might be a good idea to switch to using data.table
s from the package of the same name. They are fast. You'd do something roughly equivalent something like this:
library(data.table)
dat <- data.frame(x = runif(100),
dt = seq.Date(as.Date('2010-01-01'),as.Date('2011-01-01'),length.out = 100),
grp = rep(letters[1:4],each = 25))
dt <- as.data.table(dat)
key(dt) <- "grp"
dt[,mutate(.SD,date.min = min(dt),
date.max = max(dt),
daterange = max(dt) - min(dt)), by = grp]
Upvotes: 12
Reputation: 58845
A couple of things come to mind.
First, I would write it as:
dset <- ddply(dset, .(tic), summarise,
date.min = min(date),
date.max = max(date),
daterange = max(date) - min(date),
.parallel = TRUE)
Well, actually, I would probably avoid double calculating min/max date and write
dset <- ddply(dset, .(tic), function(DF) {
mutate(summarise(DF, date.min = min(date),
date.max = max(date)),
daterange = date.max - date.min)},
.parallel = TRUE)
but that's not the main point you are asking about.
With a dummy data set of your dimensions
n <- 422105
dset <- data.frame(date=as.Date("2000-01-01")+sample(3650, n, replace=TRUE),
tic = factor(sample(10, n, replace=TRUE)))
for (i in 3:25) {
dset[i] <- rnorm(n)
}
this ran comfortably (sub 1 minute) on my laptop. In fact the plyr
step took less time than creating the dummy data set. So it couldn't have been swapping to the size you saw.
A second possibility is if there are a large number of unique values of tic
. That could increase the size needed. However when I tried it increasing the possible number of unique tic
values to 1000, it didn't really slow down.
Finally, it could be something in the parallelization. I don't have a parallel backend registered for foreach
, so it was just doing a serial approach. Perhaps that is causing your memory explosion.
Upvotes: 4