Reputation: 1
Trying to run a spline function by rows (690075) in a dataframe (Camera1) with 4096 columns (each column represents a position on the x axis) where the input variable to the function is a column in another dataset of the same length (test$vr) using a for loop; but I am having serious computational time issues.
I have tried converting the dataframe to a matrix and storing the output in a list amongst others, but to no avail. I have to do this for 2 other dataframes (camera2,camera3) of the same size.
Code
# Note camera1 and test$vr are of the same length
# Initialize
final.data1 <- data.frame()
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096),3)
system.time({
for (i in 1:690075) {
w1 = (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr[i]/299792.458)
my.data1<-as.data.frame(t(splinefun(x = w1, y = camera1[i,][-1])(y1)))
colnames(my.data1)=y1
final.data1 <- bind_rows(final.data1, my.data1)
} })
Running on a Ubuntu box with 344GB ram and 30 core Intel(R) Xeon(R) CPU E5-2695 @ 2.30GHz
Any suggestions would be greatly appreciated. Thank you.
Upvotes: 0
Views: 103
Reputation: 166
First remove all instructions that can be done once, and put them outside the for loop. For example: colnames
and as.numeric
.
Second, try to vectorize. It seems that the w1
calculation can be vectorized, and so estimated once outside the for loop, by just removing the [i]
.
Third, initialize the final.data1
to the final dimension. For each row added to this data.frame, R will create a new data.frame with one more row, then remove the previous data.frame. It will take long time. Thus, final.data1 <- matrix(NA, ncol=length(y1), nrow=NROW)
.
And finally, if you want to use more than one core, try to replace the for loop by the parallelized foreach loop
. It is possible if all rows are independant:
require(foreach)
require(doSNOW)
cl <- makeCluster(25, type="FORK") # FORK not usable in Windows
registerDoSNOW(cl) # register the cluster
clusterExport(cl, c("objects", "needed", "by", "each", "iteration"), envir=environment()) # for example y1, w1 and camera1
final.data1<- foreach(i=icount(NROW), .combine=rbind, inorder=FALSE) %dopar%
{
# your R code
}
stopCluster(cl)
Upvotes: 3
Reputation: 76641
Without seeing the data it's not easy to optimize your code, but I would start with something along the lines of the following.
final.data1 <- matrix(nrow = 690075, ncol = 4096)
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096), 3)
system.time({
w1 <- (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr/299792.458)
for (i in 1:690075) {
my.data1 <- t(splinefun(x = w1[i], y = camera1[i, ][-1])(y1))
final.data1[i, ] <- my.data1
}
})
final.data1 <- as.data.frame(final.data1)
colnames(final.data1) <- y1
Explanation:
I start by defining an object of class matrix
to hold the results. I believe I got the dimensions of your final data.frame
right. This reduces the running time because
Matrices are much faster than data frames, they are just folded vectors and indexing is fast. Data frames, on the contrary, are lists that can hold all types of data, numeric, character, logical, other lists, etc., and therefore it's slow to access their members.
By reserving the result's full memory in one operation saves R's memory management routines a lot of work. To extend final.data1
every iteration through the loop, is very time consuming.
w1
is computed outside the loop, taking advantage of R's vectorized nature. Besides, you were repeating the computation of as.numeric(colnames(camera1[-1]))
690k times!
Test this code and if it doesn't produce the same final result, just say so and I will see if I can do something to debug it.
Upvotes: 3