Reputation: 43
I need to calculate correlations on a large dataset (> 1 million of lines) split by several columns. I try to do it combining ddply
and cor()
functions:
func <- function(xx) {
return(data.frame(corB = cor(xx$ysales, xx$bas.sales),
corA = cor(xx$ysales, xx$tysales)))
}
output <- ddply(input, .(IBD,cell,cat), func)
This code works pretty well on relatively small data sets (dataframes with 1000 lines or 10000 lines), but causes a 'fatal error' when the input file has 100000 lines or more. So it looks like there is not enough memory on my computer to process such a big file with these functions.
Are there opportunities to optimize such code somehow? Maybe some alternatives to ddply
work more effectively, or using loops that would split one function into several consecutive?
Upvotes: 4
Views: 352
Reputation: 4474
I do not have any problems with ddply
on my machine even with 1e7
rows and data as given below. In total, it uses up approx. 1.7 GB on my machine.
Here is my code:
options(stringsAsFactors=FALSE)
#this makes your code reproducible
set.seed(1234)
N_rows=1e7
input=data.frame(IBD=sample(letters[1:5],N_rows,TRUE),
cell=sample(letters[1:5],N_rows,TRUE),
cat=sample(letters[1:5],N_rows,TRUE),
ysales=rnorm(N_rows),
tysales=rnorm(N_rows),
bas.sales=rnorm(N_rows))
#your solution
library(plyr)
func <- function(xx) {
return(data.frame(corB = cor(xx$ysales, xx$bas.sales),
corA = cor(xx$ysales, xx$tysales)))
}
output <- ddply(input, .(IBD,cell,cat), func)
However, in case your problem is more complex than my sample data, you could try the data.table
package. Here some code (please note that I am not a heavy user of data.table
and that the code below might be inefficient)
library(data.table)
input_dt=data.table(input)
output_dt=unique(input_dt[,`:=`(corB=cor(.SD$ysales,.SD$bas.sales),
corA=cor(.SD$ysales,.SD$tysales))
,by=c('IBD','cell','cat')]
[,c('IBD','cell','cat','corB','corA'),with=FALSE])
output_dt=output_dt[order(output_dt$IBD,output_dt$cell,output_dt$cat)]
It gives the same result
all.equal(data.table(output),output_dt)
#[1] TRUE
head(output_dt,3)
# IBD cell cat corB corA
#1: a a a -6.656740e-03 -0.0050483282
#2: a a b 4.758460e-03 0.0051115833
#3: a a c 1.751167e-03 0.0036150088
Upvotes: 1