htii
htii

Reputation: 21

How can I vectorize this double for loop?

Recently, I have been trying to work on writing more code. In my training with R, I never really learned how to do vectorizations to speed code up. Now, while I'm at work, I am the only one using R and I would like to have an impact and part of that impact includes writing fast code. For a specific project I am running a double for loop that is taking approximately 22 minutes to run on the hardware they provided us. Is there a way I can speed this up with vectorization? I have created some toy data to simulate what I have been doing at work to accomplish my task with a double for loop/

df1 <- data.frame(x = 1:600000,
                  y = sample(1:24, size = 600000),
                  z = sample(1:3, 600000, replace = TRUE),
                  w = sample(c("yes", "no"), size = 1, replace = TRUE))

df2 <- data.frame(a = 1:24, 
                  b = sample(1:24, size = 1, replace = FALSE),
                  c = sample(c(20,30,40), size = 1, replace = TRUE), 
                  d = sample(c("yes", "no"), size = 1, replace = TRUE),
                  e = sample(c(TRUE, FALSE), replace = TRUE, size = 1))



df1$pay <- numeric(600000)
df1$ynm <- character(600000)


start <- Sys.time()
for(i in 1:nrow(df1)) {
  for(j in 1:nrow(df2)){
    if(df1$y[i] == df2$b[j] & df1$w[i] == df2$d[j]) {
      df1$pay[i] <- df2$c[j]
      df1$ynm[i] <- df2$e[j]
    }
  }
}
Sys.time() - start #timing thrown in for benchmarking -- 
#currently my system is Time difference of 5.407024 mins

I realize not all the items will be a complete case. This is expected just due to the nature of the data. My main goal is to speed this task up. The datasets are different sizes on purpose because I need to classify based on different items from one dataset to the other. Additionally, the benchmark is done on my personal PC, which is much more powerful than my work laptop.

Upvotes: 0

Views: 38

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 174546

You could get an identical result far faster like this:

library(dplyr)

start <- Sys.time()

df3 <- left_join(df1, df2, by = c(y = "b", w = "d")) %>% 
  select(-a) %>% 
  rename(pay = c, ynm = e) %>%
  replace_na(list(pay = 0, ynm = ""))

Sys.time() - start
#> Time difference of 0.6406 secs

With

head(df3)
#>   x  y z  w pay ynm
#> 1 1 12 1 no   0    
#> 2 2 17 1 no   0    
#> 3 3 21 3 no   0    
#> 4 4  9 1 no   0    
#> 5 5  7 1 no   0    
#> 6 6  1 2 no   0    

Upvotes: 1

pseudospin
pseudospin

Reputation: 2777

I think this is just a join and so should be almost instantaneous for this size of problem - but I think something's wrong with your toy data. df2 would have to be unique on columns b and d otherwise your for loop is just repeatedly overwriting elements of df1.

library(data.table)
setDT(df1)
setDT(df2)
df1[, c('pay', 'ynm') := df2[df1, on = c('b' = 'y', 'd' = 'w'), .(c, e)]]

Upvotes: 1

Related Questions