Carlos
Carlos

Reputation: 5

Faster alternative methods to for-loop

I have a database with thousands of records. The "for i loop" iteration takes a long time to execute, it is not possible to use it because of the size of the database.

I need to perform an iteration where:

  1. if row i of x is equal to row i-1 of x then,
  2. y(i) = y(i-1) + z(i), y is a cumulative variable
  3. Else y(i) = z(i)

Previously, the database is sorted by variable x

for (i in 2:n) {
  if (db$x[i] == db$x[i - 1]) {
    db$y[i] <- db$y[i - 1] + db$z[i]  
  } else {
    db$y[i] <- db$z[i]
  }
}

Is there a way to make it more efficient?

Any help will be appreciated

Upvotes: 0

Views: 93

Answers (2)

jay.sf
jay.sf

Reputation: 72813

You can avoid loops at all, use subsets which is much faster.

Here a small data base.

db
#    x y z
# 1  1 3 1
# 2  5 6 3
# 3  1 7 2
# 4  1 1 5
# 5  2 3 3
# 6  4 3 5
# 7  2 5 4
# 8  2 1 4
# 9  1 8 2
# 10 4 6 4

First, create a boolean vector where x and its lag are equal.

u <- db$x == c(FALSE, db$x[-nrow(db)])

Next, care for the positive case

db$y[u] <- (c(NA, db$y[-nrow(db)]) + db$z)[u]

and for the negative case.

db$y[!u] <- db$z[!u]

And we get:

db
#    x  y z
# 1  1  1 1
# 2  5  3 3
# 3  1  2 2
# 4  1 12 5
# 5  2  3 3
# 6  4  5 5
# 7  2  4 4
# 8  2  9 4
# 9  1  2 2
# 10 4  4 4

Data:

set.seed(42)
db <- data.frame(x=sample.int(5, 10, replace=TRUE),
                 y=rpois(10, 3), z=rpois(10, 3))

Upvotes: 0

nicola
nicola

Reputation: 24480

First, some repro data:

set.seed(1)
n<-100000
db<-data.frame(x = sample(10, n, TRUE), z = runif(n))

Then, rewrite your code correcting some index to make it work:

db$y<-0
db$y[1]<-db$z[1]
for( i in 2:n) {
  if (db$x[i]==db$x[i-1]) {
      db$y[i] <- db$y[i-1] + db$z[i]  
      } else {db$y[i] <- db$z[i]}
  if (i%%1000==0) cat(i,"\n")
  
}

You can try:

require(data.table)
xr<-rleid(db$x)
y<-ave(db$z, xr, FUN = cumsum)
#check whether the results match
all.equal(y, db$y)

Upvotes: 1

Related Questions