Reputation: 55
I have written a for loop in my code
for(i in 2:nrow(ProductionWellYear2)) {
if (ProductionWellYear2[i,ncol(ProductionWellYear2)] == 0) {
ProductionWellYear2[i, ncol(ProductionWellYear2)] = ProductionWellYear2[i-1,ncol(ProductionWellYear2)] +1}
else {ProductionWellYear2[i,ncol(ProductionWellYear2)] = ProductionWellYear2[i,ncol(ProductionWellYear2)]}
}
However, this is very time intensive as this dataframe has over 800k rows. How can I make this quicker and avoid the for loop?
Upvotes: 0
Views: 145
Reputation: 4367
This should work for you, but without seeing your data I can't verify the results are what you want. That being said, there's really not much different here in the process as originally written, but benchmarking does seem to show it is faster with my example data, but not necessarily "fast".
library(microbenchmark)
# Create fake data
set.seed(1)
ProductionWellYear <- data.frame(A = as.integer(rnorm(2500)),
B = as.integer(rnorm(2500)),
C = as.integer(rnorm(2500))
)
# Copy it to confirm results of both processes are the same
ProductionWellYear2 <- ProductionWellYear
# Slightly modified original version
method1 <- function() {
cols <- ncol(ProductionWellYear)
for(i in 2:nrow(ProductionWellYear)) {
if (ProductionWellYear[i, cols] == 0) {
ProductionWellYear[i, cols] = ProductionWellYear[i - 1, cols] +1
}
else {
ProductionWellYear[i, cols] = ProductionWellYear[i, cols]
}
}
}
# New version
method2 <- function() {
cols <- ncol(ProductionWellYear2)
sapply(2:nrow(ProductionWellYear2), function(i) {
if (ProductionWellYear2[i, cols] == 0) {
ProductionWellYear2[i, cols] <<- ProductionWellYear2[i - 1, cols] +1
}
})
}
# Comparing the outputs
all(ProductionWellYear == ProductionWellYear2)
#[1] TRUE
result <- microbenchmark(method1(), method2())
result
#Unit: milliseconds
# expr min lq mean median uq max neval
# method1() 151.78802 167.3932 190.14905 176.2855 197.60406 337.9904 100
# method2() 45.56065 53.7744 67.55549 59.9299 72.81873 174.1417 100
Upvotes: 0
Reputation: 72828
You could use conditional assignment, using R's potential as a vectorized language.
Consider this initial data frame:
X1 X2 X3 year
1 1.3709584 -0.09465904 -0.1333213 2014
2 -0.5646982 2.01842371 0.6359504 0
3 0.3631284 -0.06271410 -0.2842529 2016
4 0.6328626 1.30486965 -2.6564554 0
5 0.4042683 2.28664539 -2.4404669 2018
6 -0.1061245 -1.38886070 1.3201133 0
7 1.5115220 -0.27878877 -0.3066386 2020
Then do:
num.col <- ncol(ProductionWellYear2) # to keep code short
ProductionWellYear2[ProductionWellYear2[num.col] == 0, num.col] <-
ProductionWellYear2[which(ProductionWellYear2[num.col] == 0) - 1, num.col] + 1
Resulting data frame:
X1 X2 X3 year
1 -0.16137564 -1.0344340 -2.18025447 2014
2 0.60828818 1.8149734 1.11955225 2015
3 0.02006922 1.1641742 2.08033131 2016
4 -0.70472925 0.4136222 0.95275587 2017
5 0.43061575 1.0180987 -0.26629157 2018
6 -2.49764918 0.5957401 -2.06162220 2019
7 -1.00775410 1.1497179 -0.03193637 2020
Data:
ProductionWellYear2 <- structure(list(X1 = c(1.37095844714667, -0.564698171396089, 0.363128411337339,
0.63286260496104, 0.404268323140999, -0.106124516091484, 1.51152199743894
), X2 = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421,
1.30486965422349, 2.28664539270111, -1.38886070111234, -0.278788766817371
), X3 = c(-0.133321336393658, 0.635950398070074, -0.284252921416072,
-2.65645542090478, -2.44046692857552, 1.32011334573019, -0.306638594078475
), year = c(2014, 0, 2016, 0, 2018, 0, 2020)), row.names = c(NA,
-7L), class = "data.frame")
Upvotes: 2