Blake Lynch
Blake Lynch

Reputation: 1

How to speed up a for-loop when order of execution matters?

I have a table containing historical daily price data for different products and NA for future prices. I have a column for the expected price increase on a given future date for a product. The price increase is based off of prior day price.

I have constructed a for-loop to calculate the expected price for the products, but it runs very slowly for the ~500,000 records it is looping through.

All historical price data is in the table while all projected price is NA.

Example of current table (old_table):

date        product        price        incr_amt
====================================================
...          ...            ...         ...
10/14/19     prod1          50          1.0
10/15/19     prod1          50          1.0
10/16/19     prod1          NA          1.0
...          ...            ...         ...
04/01/20     prod1          NA          1.05
04/02/20     prod1          NA          1.0
...          ...            ...         ...
...          ...            ...         ...
10/14/19     prod2          35          1.0
10/15/19     prod2          35          1.0
10/16/19     prod2          NA          1.0
...          ...            ...         ...
01/01/20     prod2          NA          1.02
01/02/20     prod2          NA          1.0
...          ...            ...         ...

My current code groups by product, then if price is NA then calculate price as lagged price * increase_amt. Then recalculate lagged_price for next iteration. Loop through until all rows in table.

Example result (new_table):

date        product        price        incr_amt
====================================================
...          ...            ...         ...
10/14/19     prod1          50          1.0
10/15/19     prod1          50          1.0
10/16/19     prod1          50          1.0
...          ...            ...         ...
04/01/20     prod1          52.5        1.05
04/02/20     prod1          52.5        1.0
...          ...            ...         ...
...          ...            ...         ...
10/14/19     prod2          35          1.0
10/15/19     prod2          35          1.0
10/16/19     prod2          35          1.0
...          ...            ...         ...
01/01/20     prod2          35.7        1.02
01/02/20     prod2          35.7        1.0
...          ...            ...         ...

My current code works, but it takes over an hour to run. Because each iteration is dependent upon the previous and order matters, I don't know if there is a work around using a loop.

Current Code:

library(tidyverse)

old_table <- tribble(
  ~date, ~product, ~price, ~incr_amt,
  "2019-10-14", "prod1", 50, 1.0,
  "2019-10-15", "prod1", 50, 1.0,
  "2019-10-16", "prod1", NA, 1.0,
  "2019-10-17", "prod1", NA, 1.0,
  "2019-10-18", "prod1", NA, 1.0,
  "2019-10-19", "prod1", NA, 1.05,
  "2019-10-20", "prod1", NA, 1.0,
  "2019-10-21", "prod1", NA, 1.0,
  "2019-10-14", "prod2", 35, 1.0,
  "2019-10-15", "prod2", 35, 1.0,
  "2019-10-16", "prod2", NA, 1.0,
  "2019-10-17", "prod2", NA, 1.0,
  "2019-10-18", "prod2", NA, 1.0,
  "2019-10-19", "prod2", NA, 1.0,
  "2019-10-20", "prod2", NA, 1.0,
  "2019-10-21", "prod2", NA, 1.02,
  "2019-10-22", "prod2", NA, 1.0
)

new_table <- old_table %>%
  group_by(product) %>%
  mutate(lag_price = lag(price))

for (i in 1:nrow(new_table)) {
  if (!is.na(new_table$price[[i]]))
    next
  if (is.na(new_table$price[[i]])) {
    new_table$price[[i]] = new_table$lag_price[[i]] * new_table$incr_amt[[i]]
    new_table$lag_price <- lag(new_table$price)
  }

}

The code runs, but takes over an hour to loop through the ~500,000 records. How can I improve this process? Thanks.

Upvotes: 0

Views: 168

Answers (1)

Jon Spring
Jon Spring

Reputation: 66510

Here's a vectorized solution that I expect will be much faster. (I'd be curious how much faster on your real data.) The main thing slowing your code down is, as @aocall notes, the 500,000 table modifications. It should be much faster if we can apply the same calculations to the whole table at once. Here, we calculate the cumulative growth across each missing section within each product. (We also unnecessarily calculate the growth across non-missing sections but I assume the overhead will be minimal.) Then we can apply that growth factor to the last available number to get the filled-in one.

library(dplyr)
new_table2 <- old_table %>%
  # Put together strings of missingness & track cumulative growth in them
  group_by(product) %>%
  mutate(missing_streak = cumsum(is.na(price) != is.na(lag(price)))) %>%

  # fill in NA with last value
  mutate(price_new = price) %>%
  tidyr::fill(price_new) %>%

  # gross up based on growth
  group_by(product, missing_streak) %>%
  mutate(cuml_growth = cumprod(incr_amt)) %>%
  mutate(price_new = if_else(is.na(price),
                             price_new * cuml_growth,
                             price)) %>%
  ungroup()

Seems to work on your data:

identical(new_table$price, new_table2$price_new)
[1] TRUE

Upvotes: 1

Related Questions