user3969377
user3969377

Reputation:

efficient loop in R

The data looks like

   cum_ft source 

 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds
 123.1018   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
   0.0000  maint 
   0.0000  maint 
   0.0000  maint 
 126.7622   imds 
 126.7622   imds 
 126.7622   imds 

The goal is set the value for maint to the last value from imds

   cum_ft source 
 123.1018   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585   imds 
 125.4585  maint 
 125.4585  maint 
 125.4585  maint 
 126.7622   imds 
 126.7622   imds 
 126.7622   imds 

I'm trying, without success, something like

maint_rows_to_change = which(temp_df$source=="maint")
diff_maint_row_to_change = diff(maint_rows_to_change)
imds_rows_with_data = which(temp_df$source=="imds")
diff_imds_row_to_change = diff(imds_rows_with_data)
rows_to_change_increment = which(diff_update_row > 1)

At this point, diff_maint_row_to_change has numbers greater than one when there are imsl data to skip over, and values of one when there are consecutive maintenance rows that have to be adjusted. The adjustment is to set the value of cum_ft for the maintenance rows to the last value from the imsl data.

What I would like to write is something like the expression below, but I'm not clear on how to come up with last_imds_row. In this example, maint_rows_to_change = c(11,12,13) and last_imds_row = c(10,10,10).

temp_df$cum_ft[maint_rows_to_change] = temp_df$cum_ft[last_imds_row]

I also tried a loop, with some success, but takes too long

fun1 <- function(z) {
  z$cum_ft_cor = z$cum_ft
  rows_to_fix = which(z$source=="maint")
  z$cum_ft_cor[rows_to_fix]=-1
  for(i in rows_to_fix) {
    z$cum_ft_cor[i] <- z$cum_ft_cor[i-1]
  }
  z
}
temp_df_2 =  fun1(temp_df)

Upvotes: 4

Views: 189

Answers (1)

josliber
josliber

Reputation: 44340

One option is to make a looping solution faster with the Rcpp package:

library(Rcpp)
copyDat <- cppFunction(
'void copyDat(NumericVector x, std::vector<std::string> y) {
  for (int i=1; i < y.size(); ++i) {
    if (y[i] == "maint") x[i] = x[i-1];
  }
}')

Then you could do:

copyDat(temp_df$cum_ft, as.character(temp_df$source))
temp_df
#      cum_ft source
# 1  125.4585   imds
# 2  125.4585   imds
# 3  125.4585   imds
# 4  125.4585   imds
# 5  125.4585   imds
# 6  125.4585   imds
# 7  123.1018   imds
# 8  125.4585   imds
# 9  125.4585   imds
# 10 125.4585   imds
# 11 125.4585  maint
# 12 125.4585  maint
# 13 125.4585  maint
# 14 126.7622   imds
# 15 126.7622   imds
# 16 126.7622   imds

On an example with 1.3 million rows, the Rcpp solution is about 6x faster than the zoo solution posted in the comments (though both are quite quick):

# Functions to benchmark
josilber <- function(temp_df) {
  copyDat(temp_df$cum_ft, as.character(temp_df$source))
  temp_df
}
library(zoo)
darenburg <- function(temp_df) {
  temp_df[temp_df$source == "maint", "cum_ft"] <- NA
  temp_df$cum_ft <- na.locf(temp_df$cum_ft)
  temp_df
}

# Do the test
library(microbenchmark)
temp_df <- data.frame(cum_ft=rnorm(1300000),
                      source=rep(c(rep("imds", 10), rep("maint", 3)), 100000))
all.equal(josilber(temp_df), darenburg(temp_df))
# [1] TRUE
microbenchmark(josilber(temp_df), darenburg(temp_df))
# Unit: milliseconds
#                expr       min        lq    median        uq      max neval
#   josilber(temp_df)  78.05012  83.80206  86.96831  92.56959 122.5809   100
#  darenburg(temp_df) 464.33525 492.76668 510.65864 541.43435 703.6944   100

Upvotes: 2

Related Questions