Reputation: 33
I wrote a very simple code in R but it needs almost 2 hours when using it for data > 2.000.000 rows.
Is there any opportunity to improve the code? I would prefer a solution as easy as possible.
My R skills are okay (experience < 1 year) but I reached my limit in this case. Furthemore I read some articels about speeding up if else loops but I am not sure which strategy is most suitable for my code (e.g. Vectorise, ifelse, Parallelism, etc.)
Thanks for help.
system.time(
for (i in 1:(length(mydata$session_id)-1)){
if (mydata$session_id[i] != mydata$session_id[i+1]){
mydata$Einstiegskanal[i]="1"
} else {
mydata$Einstiegskanal[i]="0"
}
}
)
# 6877,1 Seconds = 1,91 h
Upvotes: 1
Views: 322
Reputation: 33
Thank you very much for your answers!
The following adapted code from Benjamin works perfectly for me :) The diff function in combination with else if is very smart and it works for many of my if else loops.
system.time({
mydata$Einstiegskanal<-ifelse(c(diff(mydata$session_id) == 0, NA), "0", "1")
})
Upvotes: 0
Reputation: 17369
It appears what you're doing is just a difference between the ids from one row to the next. diff
was made for this.
session_id <- sample(1:10, size = 2000000, replace = TRUE)
system.time({
ifelse(c(diff(session_id) == 0, NA), "1", "0")
})
user system elapsed
0.64 0.05 0.69
If you really want to speed it up, you can try avoiding the ifelse
as well.
Your code would be
lgl <- c(diff(x) == 0, NA)
mydata$Einstiegskanal[!lgl] <- "1"
mydata$Einstiegskanal[lgl] <- "0"
For a comparison of speed between the two approaches:
library(microbenchmark)
session_id <- sample(1:10, size = 2000000, replace = TRUE)
y <- vector("character", length(session_id))
microbenchmark(
with_ifelse = ifelse(c(diff(session_id) == 0, NA), "1", "0"),
avoid_ifelse = {
lgl <- c(diff(session_id) == 0, NA)
y[lgl] <- "1"
y[!lgl] <- "0"
},
times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
with_ifelse 684.69879 686.16912 710.3928 714.88029 726.61384 736.1481 10 b
avoid_ifelse 88.75335 89.21844 98.8694 90.46677 92.03064 139.8182 10 a
Upvotes: 3
Reputation: 507
You can try comething like that:
mydata <- data.frame(session_id = round(runif(2e6, 0, 10), 0))
mydata2 <- data.frame(session_id = mydata[-1,])
mydata$Einstiegskanal <- c(ifelse(mydata$session_id[1:(nrow(mydata)-1)]==mydata2,1,0), NA)
I set the last value of df$Equal
to NA
as this vector has one less element than df
.
Upvotes: 0