Reputation: 1
I have a table containing historical daily price data for different products and NA for future prices. I have a column for the expected price increase on a given future date for a product. The price increase is based off of prior day price.
I have constructed a for-loop to calculate the expected price for the products, but it runs very slowly for the ~500,000 records it is looping through.
All historical price data is in the table while all projected price is NA.
Example of current table (old_table):
date product price incr_amt
====================================================
... ... ... ...
10/14/19 prod1 50 1.0
10/15/19 prod1 50 1.0
10/16/19 prod1 NA 1.0
... ... ... ...
04/01/20 prod1 NA 1.05
04/02/20 prod1 NA 1.0
... ... ... ...
... ... ... ...
10/14/19 prod2 35 1.0
10/15/19 prod2 35 1.0
10/16/19 prod2 NA 1.0
... ... ... ...
01/01/20 prod2 NA 1.02
01/02/20 prod2 NA 1.0
... ... ... ...
My current code groups by product, then if price is NA then calculate price as lagged price * increase_amt. Then recalculate lagged_price for next iteration. Loop through until all rows in table.
Example result (new_table):
date product price incr_amt
====================================================
... ... ... ...
10/14/19 prod1 50 1.0
10/15/19 prod1 50 1.0
10/16/19 prod1 50 1.0
... ... ... ...
04/01/20 prod1 52.5 1.05
04/02/20 prod1 52.5 1.0
... ... ... ...
... ... ... ...
10/14/19 prod2 35 1.0
10/15/19 prod2 35 1.0
10/16/19 prod2 35 1.0
... ... ... ...
01/01/20 prod2 35.7 1.02
01/02/20 prod2 35.7 1.0
... ... ... ...
My current code works, but it takes over an hour to run. Because each iteration is dependent upon the previous and order matters, I don't know if there is a work around using a loop.
Current Code:
library(tidyverse)
old_table <- tribble(
~date, ~product, ~price, ~incr_amt,
"2019-10-14", "prod1", 50, 1.0,
"2019-10-15", "prod1", 50, 1.0,
"2019-10-16", "prod1", NA, 1.0,
"2019-10-17", "prod1", NA, 1.0,
"2019-10-18", "prod1", NA, 1.0,
"2019-10-19", "prod1", NA, 1.05,
"2019-10-20", "prod1", NA, 1.0,
"2019-10-21", "prod1", NA, 1.0,
"2019-10-14", "prod2", 35, 1.0,
"2019-10-15", "prod2", 35, 1.0,
"2019-10-16", "prod2", NA, 1.0,
"2019-10-17", "prod2", NA, 1.0,
"2019-10-18", "prod2", NA, 1.0,
"2019-10-19", "prod2", NA, 1.0,
"2019-10-20", "prod2", NA, 1.0,
"2019-10-21", "prod2", NA, 1.02,
"2019-10-22", "prod2", NA, 1.0
)
new_table <- old_table %>%
group_by(product) %>%
mutate(lag_price = lag(price))
for (i in 1:nrow(new_table)) {
if (!is.na(new_table$price[[i]]))
next
if (is.na(new_table$price[[i]])) {
new_table$price[[i]] = new_table$lag_price[[i]] * new_table$incr_amt[[i]]
new_table$lag_price <- lag(new_table$price)
}
}
The code runs, but takes over an hour to loop through the ~500,000 records. How can I improve this process? Thanks.
Upvotes: 0
Views: 168
Reputation: 66510
Here's a vectorized solution that I expect will be much faster. (I'd be curious how much faster on your real data.) The main thing slowing your code down is, as @aocall notes, the 500,000 table modifications. It should be much faster if we can apply the same calculations to the whole table at once. Here, we calculate the cumulative growth across each missing section within each product. (We also unnecessarily calculate the growth across non-missing sections but I assume the overhead will be minimal.) Then we can apply that growth factor to the last available number to get the filled-in one.
library(dplyr)
new_table2 <- old_table %>%
# Put together strings of missingness & track cumulative growth in them
group_by(product) %>%
mutate(missing_streak = cumsum(is.na(price) != is.na(lag(price)))) %>%
# fill in NA with last value
mutate(price_new = price) %>%
tidyr::fill(price_new) %>%
# gross up based on growth
group_by(product, missing_streak) %>%
mutate(cuml_growth = cumprod(incr_amt)) %>%
mutate(price_new = if_else(is.na(price),
price_new * cuml_growth,
price)) %>%
ungroup()
Seems to work on your data:
identical(new_table$price, new_table2$price_new)
[1] TRUE
Upvotes: 1