Tim Batten
Tim Batten

Reputation: 55

Translate R for loop into apply function

I have written a for loop in my code

for(i in 2:nrow(ProductionWellYear2)) {

  if (ProductionWellYear2[i,ncol(ProductionWellYear2)] == 0) {
    ProductionWellYear2[i, ncol(ProductionWellYear2)] = ProductionWellYear2[i-1,ncol(ProductionWellYear2)] +1}


  else {ProductionWellYear2[i,ncol(ProductionWellYear2)] = ProductionWellYear2[i,ncol(ProductionWellYear2)]}


  }

However, this is very time intensive as this dataframe has over 800k rows. How can I make this quicker and avoid the for loop?

Upvotes: 0

Views: 145

Answers (2)

Gaffi
Gaffi

Reputation: 4367

This should work for you, but without seeing your data I can't verify the results are what you want. That being said, there's really not much different here in the process as originally written, but benchmarking does seem to show it is faster with my example data, but not necessarily "fast".

library(microbenchmark)
# Create fake data
set.seed(1)
ProductionWellYear <- data.frame(A = as.integer(rnorm(2500)),
                                 B = as.integer(rnorm(2500)),
                                 C = as.integer(rnorm(2500))
)

# Copy it to confirm results of both processes are the same
ProductionWellYear2 <- ProductionWellYear


# Slightly modified original version
method1 <- function() {
  cols <- ncol(ProductionWellYear)
  for(i in 2:nrow(ProductionWellYear)) {
    if (ProductionWellYear[i, cols] == 0) {
      ProductionWellYear[i, cols] = ProductionWellYear[i - 1, cols] +1
    }
    else {
      ProductionWellYear[i, cols] = ProductionWellYear[i, cols]
    }
  }
}

# New version
method2 <- function() {
  cols <- ncol(ProductionWellYear2)
  sapply(2:nrow(ProductionWellYear2), function(i) {
    if (ProductionWellYear2[i, cols] == 0) {
      ProductionWellYear2[i, cols] <<- ProductionWellYear2[i - 1, cols] +1
    }
  })
}


# Comparing the outputs
all(ProductionWellYear == ProductionWellYear2)
#[1] TRUE

result <- microbenchmark(method1(), method2())
result
#Unit: milliseconds
#      expr      min       lq     mean   median       uq       max neval
#  method1() 151.78802 167.3932 190.14905 176.2855 197.60406 337.9904   100
#  method2()  45.56065  53.7744  67.55549  59.9299  72.81873 174.1417   100

Upvotes: 0

jay.sf
jay.sf

Reputation: 72828

You could use conditional assignment, using R's potential as a vectorized language.

Consider this initial data frame:

          X1          X2         X3 year
1  1.3709584 -0.09465904 -0.1333213 2014
2 -0.5646982  2.01842371  0.6359504    0
3  0.3631284 -0.06271410 -0.2842529 2016
4  0.6328626  1.30486965 -2.6564554    0
5  0.4042683  2.28664539 -2.4404669 2018
6 -0.1061245 -1.38886070  1.3201133    0
7  1.5115220 -0.27878877 -0.3066386 2020

Then do:

num.col <- ncol(ProductionWellYear2)  # to keep code short

ProductionWellYear2[ProductionWellYear2[num.col] == 0, num.col] <- 
  ProductionWellYear2[which(ProductionWellYear2[num.col] == 0) - 1, num.col] + 1

Resulting data frame:

           X1         X2          X3 year
1 -0.16137564 -1.0344340 -2.18025447 2014
2  0.60828818  1.8149734  1.11955225 2015
3  0.02006922  1.1641742  2.08033131 2016
4 -0.70472925  0.4136222  0.95275587 2017
5  0.43061575  1.0180987 -0.26629157 2018
6 -2.49764918  0.5957401 -2.06162220 2019
7 -1.00775410  1.1497179 -0.03193637 2020

Data:

ProductionWellYear2 <- structure(list(X1 = c(1.37095844714667, -0.564698171396089, 0.363128411337339, 
0.63286260496104, 0.404268323140999, -0.106124516091484, 1.51152199743894
), X2 = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421, 
1.30486965422349, 2.28664539270111, -1.38886070111234, -0.278788766817371
), X3 = c(-0.133321336393658, 0.635950398070074, -0.284252921416072, 
-2.65645542090478, -2.44046692857552, 1.32011334573019, -0.306638594078475
), year = c(2014, 0, 2016, 0, 2018, 0, 2020)), row.names = c(NA, 
-7L), class = "data.frame")

Upvotes: 2

Related Questions