Reputation: 1057
I am trying to match some data on what could be a fairly large data set and even on the medium sized data set it is taking too long.
The task I am performing is to take a mechanical problem, then go back 6 months and look for procedural problems (failures on the part of individual employees). I match first on machine and location, so I want to match the same place with the same machine. Then I require that the procedural error comes before the mechanical one, since its in the future. Finally, I limit it to 180 days to keep things comparable.
In the data construction phase, I limit the mechanical issues to exclude the first 6 months, so I have the same 180 day block for each.
I have read a fair bit on optimizing loops. I know that you want to create a storage variable outside of the loop and then just add to it, but I don't actually have any idea how many matches it will return, so initially I had been using rbind inside of the loop. I know the upper bound on the storage variables is the number of mechanical issues * number of procedural issues, but this is gigantic and I can't allocate a vector that large. The code I have places here has my max sized storage variable approach, but I think I will have to go back to something like this:
if (counter == 1) {
pro = procedural[i, ]
other = mechanical[j, ]
}
if (counter != 1) {
pro = rbind(pro, procedural[i, ])
other = rbind(other, mechanical[j, ])
}
I have also read a fair bit about vectorization, but I have never actually managed to get it to work. I have tried a few different things on the vectorization front, but I think I must be doing something wrong.
I also tried removing the second loop and just using the which command, but that doesn't seem to work with a full column of data (from the procedural data) being compared to a single value (from the mechanical data).
Here is the code I have currently. It works for small sets of data fine, but for anything remotely large it takes forever.
maxval = mechrow * prorow
pro = matrix(nrow = maxval, ncol = ncol(procedural))
other = matrix(nrow = maxval, ncol = ncol(procedural))
numprocissues = matrix(nrow = mechrow, ncol = 1)
counter = 1
for (j in 1:mechrow) {
for (i in 1:prorow) {
if (procedural[i, 16] == mechanical[j, 16] &
procedural[i, 17] < mechanical[j, 17] &
procedural[i, 2] == mechanical[j, 2] &
abs(procedural[i, 17] - mechanical[j, 17]) < 180) {
pro[counter, ] = procedural[i, ]
other[counter, ] = mechanical[j, ]
counter = counter + 1
}
}
numprocissues[j, 1] = counter
}
The places I imagine improvement can be made is in my storage variable, potential vectorization, changing the conditions in the if statement or maybe a fancy which statement to remove a loop.
Any advice would be greatly appreciated!
Thank you.
Upvotes: 3
Views: 732
Reputation: 37754
Untested...
xy <- expand.grid(mech=1:mechrow, pro=1:prorow)
ok <- (procedural[xy$pro, 16] == mechanical[xy$mech, 16] &
procedural[xy$pro, 17] < mechanical[xy$mech, 17] &
procedural[xy$pro, 2] == mechanical[xy$mech, 2] &
abs(procedural[xy$pro, 17] - mechanical[xy$mech, 17]) < 180)
pro <- procedural[xy$pro[ok],]
other <- mechanical[xy$mech[ok],]
numprocissues <- tapply(ok, xy$mech, sum)
Upvotes: 6