Reputation:
The data looks like
cum_ft source
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
123.1018 imds
125.4585 imds
125.4585 imds
125.4585 imds
0.0000 maint
0.0000 maint
0.0000 maint
126.7622 imds
126.7622 imds
126.7622 imds
The goal is set the value for maint to the last value from imds
cum_ft source
123.1018 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 imds
125.4585 maint
125.4585 maint
125.4585 maint
126.7622 imds
126.7622 imds
126.7622 imds
I'm trying, without success, something like
maint_rows_to_change = which(temp_df$source=="maint")
diff_maint_row_to_change = diff(maint_rows_to_change)
imds_rows_with_data = which(temp_df$source=="imds")
diff_imds_row_to_change = diff(imds_rows_with_data)
rows_to_change_increment = which(diff_update_row > 1)
At this point, diff_maint_row_to_change has numbers greater than one when there are imsl data to skip over, and values of one when there are consecutive maintenance rows that have to be adjusted. The adjustment is to set the value of cum_ft for the maintenance rows to the last value from the imsl data.
What I would like to write is something like the expression below, but I'm not clear on how to come up with last_imds_row. In this example, maint_rows_to_change = c(11,12,13) and last_imds_row = c(10,10,10).
temp_df$cum_ft[maint_rows_to_change] = temp_df$cum_ft[last_imds_row]
I also tried a loop, with some success, but takes too long
fun1 <- function(z) {
z$cum_ft_cor = z$cum_ft
rows_to_fix = which(z$source=="maint")
z$cum_ft_cor[rows_to_fix]=-1
for(i in rows_to_fix) {
z$cum_ft_cor[i] <- z$cum_ft_cor[i-1]
}
z
}
temp_df_2 = fun1(temp_df)
Upvotes: 4
Views: 189
Reputation: 44340
One option is to make a looping solution faster with the Rcpp package:
library(Rcpp)
copyDat <- cppFunction(
'void copyDat(NumericVector x, std::vector<std::string> y) {
for (int i=1; i < y.size(); ++i) {
if (y[i] == "maint") x[i] = x[i-1];
}
}')
Then you could do:
copyDat(temp_df$cum_ft, as.character(temp_df$source))
temp_df
# cum_ft source
# 1 125.4585 imds
# 2 125.4585 imds
# 3 125.4585 imds
# 4 125.4585 imds
# 5 125.4585 imds
# 6 125.4585 imds
# 7 123.1018 imds
# 8 125.4585 imds
# 9 125.4585 imds
# 10 125.4585 imds
# 11 125.4585 maint
# 12 125.4585 maint
# 13 125.4585 maint
# 14 126.7622 imds
# 15 126.7622 imds
# 16 126.7622 imds
On an example with 1.3 million rows, the Rcpp solution is about 6x faster than the zoo solution posted in the comments (though both are quite quick):
# Functions to benchmark
josilber <- function(temp_df) {
copyDat(temp_df$cum_ft, as.character(temp_df$source))
temp_df
}
library(zoo)
darenburg <- function(temp_df) {
temp_df[temp_df$source == "maint", "cum_ft"] <- NA
temp_df$cum_ft <- na.locf(temp_df$cum_ft)
temp_df
}
# Do the test
library(microbenchmark)
temp_df <- data.frame(cum_ft=rnorm(1300000),
source=rep(c(rep("imds", 10), rep("maint", 3)), 100000))
all.equal(josilber(temp_df), darenburg(temp_df))
# [1] TRUE
microbenchmark(josilber(temp_df), darenburg(temp_df))
# Unit: milliseconds
# expr min lq median uq max neval
# josilber(temp_df) 78.05012 83.80206 86.96831 92.56959 122.5809 100
# darenburg(temp_df) 464.33525 492.76668 510.65864 541.43435 703.6944 100
Upvotes: 2