RBasti
RBasti

Reputation: 33

Improve slow if else loop in R

I wrote a very simple code in R but it needs almost 2 hours when using it for data > 2.000.000 rows.

Is there any opportunity to improve the code? I would prefer a solution as easy as possible.

My R skills are okay (experience < 1 year) but I reached my limit in this case. Furthemore I read some articels about speeding up if else loops but I am not sure which strategy is most suitable for my code (e.g. Vectorise, ifelse, Parallelism, etc.)

Thanks for help.

    system.time(
      for (i in 1:(length(mydata$session_id)-1)){
        if (mydata$session_id[i] != mydata$session_id[i+1]){
          mydata$Einstiegskanal[i]="1"
        } else {
          mydata$Einstiegskanal[i]="0"
        }
      }
    )

    # 6877,1 Seconds = 1,91 h

Upvotes: 1

Views: 322

Answers (3)

RBasti
RBasti

Reputation: 33

Thank you very much for your answers!

The following adapted code from Benjamin works perfectly for me :) The diff function in combination with else if is very smart and it works for many of my if else loops.

system.time({
  mydata$Einstiegskanal<-ifelse(c(diff(mydata$session_id) == 0, NA), "0", "1")
})

Upvotes: 0

Benjamin
Benjamin

Reputation: 17369

It appears what you're doing is just a difference between the ids from one row to the next. diff was made for this.

session_id <- sample(1:10, size = 2000000, replace = TRUE)

system.time({
  ifelse(c(diff(session_id) == 0, NA), "1", "0")
})
   user  system elapsed 
   0.64    0.05    0.69

If you really want to speed it up, you can try avoiding the ifelse as well.

Your code would be

lgl <- c(diff(x) == 0, NA)

mydata$Einstiegskanal[!lgl] <- "1"
mydata$Einstiegskanal[lgl] <- "0"

For a comparison of speed between the two approaches:

library(microbenchmark)
session_id <- sample(1:10, size = 2000000, replace = TRUE)

y <- vector("character", length(session_id))

microbenchmark(
  with_ifelse = ifelse(c(diff(session_id) == 0, NA), "1", "0"),
  avoid_ifelse = {
    lgl <- c(diff(session_id) == 0, NA)
    y[lgl] <- "1"
    y[!lgl] <- "0"
  },
  times = 10)

Unit: milliseconds
         expr       min        lq     mean    median        uq      max neval cld
  with_ifelse 684.69879 686.16912 710.3928 714.88029 726.61384 736.1481    10   b
 avoid_ifelse  88.75335  89.21844  98.8694  90.46677  92.03064 139.8182    10  a 

Upvotes: 3

Hugo
Hugo

Reputation: 507

You can try comething like that:

mydata <- data.frame(session_id = round(runif(2e6, 0, 10), 0))
mydata2 <- data.frame(session_id = mydata[-1,])
mydata$Einstiegskanal <- c(ifelse(mydata$session_id[1:(nrow(mydata)-1)]==mydata2,1,0), NA)

I set the last value of df$Equal to NA as this vector has one less element than df.

Upvotes: 0

Related Questions