asjohnson
asjohnson

Reputation: 1057

Speeding up Nested Loop; Can it be vectorized?

I am trying to match some data on what could be a fairly large data set and even on the medium sized data set it is taking too long.

The task I am performing is to take a mechanical problem, then go back 6 months and look for procedural problems (failures on the part of individual employees). I match first on machine and location, so I want to match the same place with the same machine. Then I require that the procedural error comes before the mechanical one, since its in the future. Finally, I limit it to 180 days to keep things comparable.

In the data construction phase, I limit the mechanical issues to exclude the first 6 months, so I have the same 180 day block for each.

I have read a fair bit on optimizing loops. I know that you want to create a storage variable outside of the loop and then just add to it, but I don't actually have any idea how many matches it will return, so initially I had been using rbind inside of the loop. I know the upper bound on the storage variables is the number of mechanical issues * number of procedural issues, but this is gigantic and I can't allocate a vector that large. The code I have places here has my max sized storage variable approach, but I think I will have to go back to something like this:

if (counter == 1) {
    pro = procedural[i, ]
    other = mechanical[j, ]
}
if (counter != 1) {
    pro = rbind(pro, procedural[i, ])
    other = rbind(other, mechanical[j, ])
}

I have also read a fair bit about vectorization, but I have never actually managed to get it to work. I have tried a few different things on the vectorization front, but I think I must be doing something wrong.

I also tried removing the second loop and just using the which command, but that doesn't seem to work with a full column of data (from the procedural data) being compared to a single value (from the mechanical data).

Here is the code I have currently. It works for small sets of data fine, but for anything remotely large it takes forever.

maxval = mechrow * prorow
pro = matrix(nrow = maxval, ncol = ncol(procedural))
other = matrix(nrow = maxval, ncol = ncol(procedural))
numprocissues = matrix(nrow = mechrow, ncol = 1)
counter = 1
for (j in 1:mechrow) {
    for (i in 1:prorow) {
        if (procedural[i, 16] == mechanical[j, 16] &
            procedural[i, 17] < mechanical[j, 17] &
            procedural[i, 2] == mechanical[j, 2] &
            abs(procedural[i, 17] - mechanical[j, 17]) < 180) {

            pro[counter, ] = procedural[i, ]
            other[counter, ] = mechanical[j, ]
            counter = counter + 1
        }
    }
    numprocissues[j, 1] = counter
}

The places I imagine improvement can be made is in my storage variable, potential vectorization, changing the conditions in the if statement or maybe a fancy which statement to remove a loop.

Any advice would be greatly appreciated!

Thank you.

Upvotes: 3

Views: 732

Answers (1)

Aaron - mostly inactive
Aaron - mostly inactive

Reputation: 37754

Untested...

xy <- expand.grid(mech=1:mechrow, pro=1:prorow)
ok <- (procedural[xy$pro, 16] == mechanical[xy$mech, 16] &
       procedural[xy$pro, 17] <  mechanical[xy$mech, 17] &
       procedural[xy$pro,  2] == mechanical[xy$mech,  2] &
       abs(procedural[xy$pro, 17] -  mechanical[xy$mech, 17]) < 180)
pro   <- procedural[xy$pro[ok],]
other <- mechanical[xy$mech[ok],]
numprocissues <- tapply(ok, xy$mech, sum)

Upvotes: 6

Related Questions