dplyr group_by and iterative loop calculation

Question

I am trying to perform an iterative calculation on grouped data that depend on two previous elements within a group. As a toy example:

set.seed(100)
df = data.table(ID = c(rep("A_index1",9)),
            Year = c(2001:2005, 2001:2004), 
            Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA),
            Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE))
     ID Year Price Index

R> df
1: A_index1 2001    NA   1.5
2: A_index1 2002    NA   1.5
3: A_index1 2003    10   2.0
4: A_index1 2004    NA   1.0
5: A_index1 2005    NA   2.0
6: A_index1 2006    15   2.0
7: A_index1 2007    NA   3.0
8: A_index1 2008    13   1.5
9: A_index1 2009    NA   2.0

The objective is to fill the missing prices using the last available price and an index to adjust. I have a loop that performs these calculations, which I am trying to vectorize using dplyr.

My logic is defined in the below loop:

df$Price_adj = df$Price
for (i in 2:nrow(df)) {
  if (is.na(df$Price[i])) {
    df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i] / df$Index[i-1], 2)
  }
}

R> df
         ID Year Price Index Price_adj
1: A_index1 2001    NA   1.5        NA
2: A_index1 2002    NA   1.5        NA
3: A_index1 2003    10   2.0     10.00
4: A_index1 2004    NA   1.0      5.00
5: A_index1 2005    NA   2.0     10.00
6: A_index1 2006    15   2.0     15.00
7: A_index1 2007    NA   3.0     22.50
8: A_index1 2008    13   1.5     13.00
9: A_index1 2009    NA   2.0     17.33

In my actual large data, I will have to apply this function to multiple groups and speed is a consideration. My attempt at this is below, that needs help to point me in the right direction. I did consider Reduce, but not sure how it can incorporate two previous elements within the group.

foo = function(Price, Index){
  for (i in 2:nrow(df)) {
    if (is.na(df$Price[i])) {
      df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i] / df$Index[i-1]
    }
  }
}

df %>% 
  group_by(ID) %>% 
  mutate(Price_adj = Price,
         Price_adj = foo(Price, Index))

akuiper · Accepted Answer

One option with cumprod:

df %>% 
    # group data frame into chunks starting from non na price
    group_by(ID, g = cumsum(!is.na(Price))) %>% 
    # for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1]
    mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% 
    ungroup() %>% select(-g)

# A tibble: 9 x 5
#        ID  Year Price Index Price_adj
#            
#1 A_index1  2001    NA   1.5        NA
#2 A_index1  2002    NA   1.5        NA
#3 A_index1  2003    10   2.0     10.00
#4 A_index1  2004    NA   1.0      5.00
#5 A_index1  2005    NA   2.0     10.00
#6 A_index1  2001    15   2.0     15.00
#7 A_index1  2002    NA   3.0     22.50
#8 A_index1  2003    13   1.5     13.00
#9 A_index1  2004    NA   2.0     17.33

Group data frame by ID and cumsum(!is.na(Price)), the letter split data frame into chunks and each chunk start with a non NA Price;
first(Price) * cumprod(Index / lag(Index, default=first(Index))) does the iterative calculation, which is equivalent to the formula given in the question if you substitute Price_adj[i-1] with Price_adj[i-2] until it's Price_adj[1] or first(Price);

_{Caveat: may not be very efficient if you have many NA chunks.}

If the speed is the primary concern, you could write your function using Rcpp package:

library(Rcpp)
cppFunction("
    NumericVector price_adj(NumericVector price, NumericVector index) {
        int n = price.size();
        NumericVector adjusted_price(n);
        adjusted_price[0] = price[0];
        for (int i = 1; i < n; i++) {
            if(NumericVector::is_na(price[i])) {
                adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];
            } else {
                adjusted_price[i] = price[i];
            }
        }
        return adjusted_price;
    }")

Now use the cpp function with dplyr as follows:

cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2))

cpp_fun()
# A tibble: 9 x 5
# Groups:   ID [1]
#        ID  Year Price Index Price_adj
#            
#1 A_index1  2001    NA   1.5        NA
#2 A_index1  2002    NA   1.5        NA
#3 A_index1  2003    10   2.0     10.00
#4 A_index1  2004    NA   1.0      5.00
#5 A_index1  2005    NA   2.0     10.00
#6 A_index1  2001    15   2.0     15.00
#7 A_index1  2002    NA   3.0     22.50
#8 A_index1  2003    13   1.5     13.00
#9 A_index1  2004    NA   2.0     17.33

Benchmark:

Define r_fun as:

r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g)

On the small sample data, there's already a difference:

microbenchmark::microbenchmark(r_fun(), cpp_fun())
#Unit: milliseconds
#      expr       min        lq      mean    median        uq        max neval
#   r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975   100
# cpp_fun()  3.191278  3.308758  3.738809  3.491495  3.937006   6.627019   100

Testing on a slightly larger data frame:

df <- bind_rows(rep(list(df), 10000))
#dim(df)
#[1] 90000     4

microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10)
#Unit: milliseconds
#      expr        min         lq      mean    median        uq       max neval
#   r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576    10
# cpp_fun()   8.722794   8.888667  10.67781  10.86399  12.10647  13.68302    10

Identity test:

identical(ungroup(r_fun()), ungroup(cpp_fun()))
# [1] TRUE

dplyr group_by and iterative loop calculation

Answers (1)

Related Questions