Reputation: 5
I have a database with thousands of records. The "for i loop" iteration takes a long time to execute, it is not possible to use it because of the size of the database.
I need to perform an iteration where:
Previously, the database is sorted by variable x
for (i in 2:n) {
if (db$x[i] == db$x[i - 1]) {
db$y[i] <- db$y[i - 1] + db$z[i]
} else {
db$y[i] <- db$z[i]
}
}
Is there a way to make it more efficient?
Any help will be appreciated
Upvotes: 0
Views: 93
Reputation: 72813
You can avoid loops at all, use subsets which is much faster.
Here a small data base.
db
# x y z
# 1 1 3 1
# 2 5 6 3
# 3 1 7 2
# 4 1 1 5
# 5 2 3 3
# 6 4 3 5
# 7 2 5 4
# 8 2 1 4
# 9 1 8 2
# 10 4 6 4
First, create a boolean vector where x and its lag are equal.
u <- db$x == c(FALSE, db$x[-nrow(db)])
Next, care for the positive case
db$y[u] <- (c(NA, db$y[-nrow(db)]) + db$z)[u]
and for the negative case.
db$y[!u] <- db$z[!u]
And we get:
db
# x y z
# 1 1 1 1
# 2 5 3 3
# 3 1 2 2
# 4 1 12 5
# 5 2 3 3
# 6 4 5 5
# 7 2 4 4
# 8 2 9 4
# 9 1 2 2
# 10 4 4 4
Data:
set.seed(42)
db <- data.frame(x=sample.int(5, 10, replace=TRUE),
y=rpois(10, 3), z=rpois(10, 3))
Upvotes: 0
Reputation: 24480
First, some repro data:
set.seed(1)
n<-100000
db<-data.frame(x = sample(10, n, TRUE), z = runif(n))
Then, rewrite your code correcting some index to make it work:
db$y<-0
db$y[1]<-db$z[1]
for( i in 2:n) {
if (db$x[i]==db$x[i-1]) {
db$y[i] <- db$y[i-1] + db$z[i]
} else {db$y[i] <- db$z[i]}
if (i%%1000==0) cat(i,"\n")
}
You can try:
require(data.table)
xr<-rleid(db$x)
y<-ave(db$z, xr, FUN = cumsum)
#check whether the results match
all.equal(y, db$y)
Upvotes: 1