charliealpha
charliealpha

Reputation: 307

R, improve loop, make embarrassingly parallel?

I'm relatively new to R, and I am writing very, very slow code. I have started looking into ideas like vectorization and embarrassingly parallel, but I need help in applying them. Here is the code I am writing, and my understanding of the problem I am facing:

for (k in 1:3) {

        for (i in 1:2) {
                p[i]<- sum(output[i,1:3]>=intv[k,1])/200
                                        }

        cp[k,1]<-crossprod(port,p)

                  }

Sample data:
intv<-array(c(1,5,15),c(3,1))
output<-array(c(5,10,15,20,25,30),c(2,3))
port<-array(c(1,2,3),c(3,1))


output is 16,384 rows by 200 columns in real data set
intv is 16,384 rows in real data set

Essentially, this is picking up a value from intv (which has 16,384 distinct values), and then going over each row in output to find the number of columns that have values greater than this value. And then with the next intv value, and so on.. until many, many, many hours have gone by.

Now here is my understanding of the problem:

I appreciate I have to pick up a value from intv from first row. But I don't know why I should go over each row in output sequentially to find the number of columns greater than this value.

UPDATE: I tried lapply and replaced the for loops, but the file size was too big, even on AWS. I tried the for loops instead and it took about 3.5 hours. I would really, realllly appreciate any ideas to speed this up.

Thank you!

Changing to matrix helped a lot:

> system.time({for (i in 1:nrow(facnahum)) {
+ probm[i,1]<- sum(outputm[i,1:200]>=intvm[k,1])/200
+ 
+ }
+ })
   user  system elapsed 
   0.55    0.00    0.54 
> 
> 
> system.time({for (i in 1:length(facnahu$MDR)) {
+ prob[i]<- sum(output[i,1:200]>=intv[k,1])/200
+ 
+ }
+ })
   user  system elapsed 
   1.62    0.00    1.62 

Upvotes: 0

Views: 207

Answers (1)

GWD
GWD

Reputation: 1464

Here are some quick and dirty first steps - i.e. an abstraction of how you can start vectorizing your problem, using just a bunch of random numbers.

set.seed(12) #for comparability set a seed 
Output <- matrix(sample(x=c(10:40),40, TRUE), ncol=5)
Intv <- matrix(1:16, ncol=1) 
l <- lapply(X=Intv, FUN=`<`, Output) #reverse your operator because X=Intv
lc <- t(sapply(l, colSums))

Sorry I noticed your sample data too late.

After you are done with the above example your next step would be to replace *apply functions with e.g. par*apply functions from the snow package to enhance procedure via parallelisation.

Upvotes: 2

Related Questions