Reputation: 1634
I am trying to perform an iterative calculation on grouped data that depend on two previous elements within a group. As a toy example:
set.seed(100)
df = data.table(ID = c(rep("A_index1",9)),
Year = c(2001:2005, 2001:2004),
Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA),
Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE))
ID Year Price Index
R> df
1: A_index1 2001 NA 1.5
2: A_index1 2002 NA 1.5
3: A_index1 2003 10 2.0
4: A_index1 2004 NA 1.0
5: A_index1 2005 NA 2.0
6: A_index1 2006 15 2.0
7: A_index1 2007 NA 3.0
8: A_index1 2008 13 1.5
9: A_index1 2009 NA 2.0
The objective is to fill the missing prices using the last available price and an index to adjust. I have a loop that performs these calculations, which I am trying to vectorize using dplyr
.
My logic is defined in the below loop:
df$Price_adj = df$Price
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i] / df$Index[i-1], 2)
}
}
R> df
ID Year Price Index Price_adj
1: A_index1 2001 NA 1.5 NA
2: A_index1 2002 NA 1.5 NA
3: A_index1 2003 10 2.0 10.00
4: A_index1 2004 NA 1.0 5.00
5: A_index1 2005 NA 2.0 10.00
6: A_index1 2006 15 2.0 15.00
7: A_index1 2007 NA 3.0 22.50
8: A_index1 2008 13 1.5 13.00
9: A_index1 2009 NA 2.0 17.33
In my actual large data, I will have to apply this function to multiple groups and speed is a consideration. My attempt at this is below, that needs help to point me in the right direction. I did consider Reduce
, but not sure how it can incorporate two previous elements within the group.
foo = function(Price, Index){
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i] / df$Index[i-1]
}
}
}
df %>%
group_by(ID) %>%
mutate(Price_adj = Price,
Price_adj = foo(Price, Index))
Upvotes: 2
Views: 823
Reputation: 215117
One option with cumprod
:
df %>%
# group data frame into chunks starting from non na price
group_by(ID, g = cumsum(!is.na(Price))) %>%
# for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1]
mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>%
ungroup() %>% select(-g)
# A tibble: 9 x 5
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Group data frame by ID
and cumsum(!is.na(Price))
, the letter split data frame into chunks and each chunk start with a non NA Price;
first(Price) * cumprod(Index / lag(Index, default=first(Index)))
does the iterative calculation, which is equivalent to the formula given in the question if you substitute Price_adj[i-1]
with Price_adj[i-2]
until it's Price_adj[1]
or first(Price)
;
Caveat: may not be very efficient if you have many NA chunks.
If the speed is the primary concern, you could write your function using Rcpp
package:
library(Rcpp)
cppFunction("
NumericVector price_adj(NumericVector price, NumericVector index) {
int n = price.size();
NumericVector adjusted_price(n);
adjusted_price[0] = price[0];
for (int i = 1; i < n; i++) {
if(NumericVector::is_na(price[i])) {
adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];
} else {
adjusted_price[i] = price[i];
}
}
return adjusted_price;
}")
Now use the cpp
function with dplyr
as follows:
cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2))
cpp_fun()
# A tibble: 9 x 5
# Groups: ID [1]
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Benchmark:
Define r_fun
as:
r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g)
On the small sample data, there's already a difference:
microbenchmark::microbenchmark(r_fun(), cpp_fun())
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975 100
# cpp_fun() 3.191278 3.308758 3.738809 3.491495 3.937006 6.627019 100
Testing on a slightly larger data frame:
df <- bind_rows(rep(list(df), 10000))
#dim(df)
#[1] 90000 4
microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576 10
# cpp_fun() 8.722794 8.888667 10.67781 10.86399 12.10647 13.68302 10
Identity test:
identical(ungroup(r_fun()), ungroup(cpp_fun()))
# [1] TRUE
Upvotes: 3