Amariz
Amariz

Reputation: 43

'ddply' causes a fatal error in RStudio running correlation on a large data set: ways to optimize?

I need to calculate correlations on a large dataset (> 1 million of lines) split by several columns. I try to do it combining ddply and cor() functions:

func <- function(xx) {
 return(data.frame(corB = cor(xx$ysales, xx$bas.sales), 
                   corA = cor(xx$ysales, xx$tysales)))
}

output <- ddply(input, .(IBD,cell,cat), func)

This code works pretty well on relatively small data sets (dataframes with 1000 lines or 10000 lines), but causes a 'fatal error' when the input file has 100000 lines or more. So it looks like there is not enough memory on my computer to process such a big file with these functions.

Are there opportunities to optimize such code somehow? Maybe some alternatives to ddply work more effectively, or using loops that would split one function into several consecutive?

Upvotes: 4

Views: 352

Answers (1)

cryo111
cryo111

Reputation: 4474

I do not have any problems with ddply on my machine even with 1e7 rows and data as given below. In total, it uses up approx. 1.7 GB on my machine. Here is my code:

options(stringsAsFactors=FALSE)

#this makes your code reproducible
set.seed(1234)
N_rows=1e7
input=data.frame(IBD=sample(letters[1:5],N_rows,TRUE),
                 cell=sample(letters[1:5],N_rows,TRUE),
                 cat=sample(letters[1:5],N_rows,TRUE),
                 ysales=rnorm(N_rows),
                 tysales=rnorm(N_rows),
                 bas.sales=rnorm(N_rows))

#your solution
library(plyr)

func <- function(xx) {
  return(data.frame(corB = cor(xx$ysales, xx$bas.sales), 
                    corA = cor(xx$ysales, xx$tysales)))
}

output <- ddply(input, .(IBD,cell,cat), func)

However, in case your problem is more complex than my sample data, you could try the data.table package. Here some code (please note that I am not a heavy user of data.table and that the code below might be inefficient)

library(data.table)

input_dt=data.table(input)

output_dt=unique(input_dt[,`:=`(corB=cor(.SD$ysales,.SD$bas.sales),
                                corA=cor(.SD$ysales,.SD$tysales))
                          ,by=c('IBD','cell','cat')]
                 [,c('IBD','cell','cat','corB','corA'),with=FALSE])

output_dt=output_dt[order(output_dt$IBD,output_dt$cell,output_dt$cat)]

It gives the same result

all.equal(data.table(output),output_dt)
#[1] TRUE

head(output_dt,3)

#   IBD cell cat          corB          corA
#1:   a    a   a -6.656740e-03 -0.0050483282
#2:   a    a   b  4.758460e-03  0.0051115833
#3:   a    a   c  1.751167e-03  0.0036150088

Upvotes: 1

Related Questions