Faster alternative methods to for-loop

Question

I have a database with thousands of records. The "for i loop" iteration takes a long time to execute, it is not possible to use it because of the size of the database.

I need to perform an iteration where:

if row i of x is equal to row i-1 of x then,
y(i) = y(i-1) + z(i), y is a cumulative variable
Else y(i) = z(i)

Previously, the database is sorted by variable x

for (i in 2:n) {
  if (db$x[i] == db$x[i - 1]) {
    db$y[i] <- db$y[i - 1] + db$z[i]  
  } else {
    db$y[i] <- db$z[i]
  }
}

Is there a way to make it more efficient?

Any help will be appreciated

nicola · Accepted Answer

First, some repro data:

set.seed(1)
n<-100000
db<-data.frame(x = sample(10, n, TRUE), z = runif(n))

Then, rewrite your code correcting some index to make it work:

db$y<-0
db$y[1]<-db$z[1]
for( i in 2:n) {
  if (db$x[i]==db$x[i-1]) {
      db$y[i] <- db$y[i-1] + db$z[i]  
      } else {db$y[i] <- db$z[i]}
  if (i%%1000==0) cat(i,"
")
  
}

You can try:

require(data.table)
xr<-rleid(db$x)
y<-ave(db$z, xr, FUN = cumsum)
#check whether the results match
all.equal(y, db$y)

Faster alternative methods to for-loop

Answers (2)

Related Questions