Reputation: 475
I have the following problem I am trying to solve. Here some example data:
library(tidyverse)
library(lubridate)
date <- data.frame(date=seq(ymd('2018-01-01'),ymd('2018-02-28'), by = '1 day'))
group <- data.frame(group=c("A","B"))
subgroup <- data.frame(subgroup=c("C","D"))
DF <- merge(merge(date,group,by=NULL),subgroup,by=NULL)
DF$group_value <- apply(DF, 1, function(x) sample(8:12,1))
DF$subgroup_value <- apply(DF, 1, function(x) sample(1:5,1))
DF <- DF %>%
arrange(date,group,subgroup)
I now want to calculate the following:
for every given day t, group and subgroup combination calculate the number of days until the (backward) cumsum of subgroup_value is equal or greater than the group value of day t.
I know how to do that by using some for loops and some dplyr functionality, but this is just terrible slow:
for(i in seq(1,nrow(date),1)) {
for(j in seq(1,nrow(group),1)) {
for(k in seq(1,nrow(subgroup),1)) {
tmp <- DF %>%
filter(date<=date[i] & group == group[j] & subgroup == subgroup[k]) %>%
arrange(desc(date))
tmp$helper <- 1
tmp <- tmp %>%
mutate(
cs_helper = cumsum(helper),
cs_subgroup_value = cumsum(subgroup_value),
nr_days = case_when (
cs_subgroup_value >= group_value ~ cs_helper,
TRUE ~ NA_real_)
)
#this is the final result for date[i], group[j], subgroup[k]
value <- min(tmp[,"nr_days"], na.rm=T)
}
}
}
Example
head(DF,10)
date group subgroup group_value subgroup_value result
1 2018-01-01 A C 12 2 NA
2 2018-01-02 A C 11 4 NA
3 2018-01-03 A C 11 4 NA
4 2018-01-04 A C 9 5 2
5 2018-01-05 A C 12 5 3
6 2018-01-06 A C 10 3 3
7 2018-01-07 A C 12 5 3
8 2018-01-08 A C 8 1 3
9 2018-01-09 A C 12 4 4
10 2018-01-10 A C 9 1 4
So for row 10, I need to sum the last 4 values of subgroup to be greater or equal to 9.
I am sure that this code can be highly optimized by using some vectorized version but I am struggle to find a good starting point for that (As you can see from the code above, I am a newbie in R)
My question is: How would you approach this problem in order to vectorize it for speed optimisation?
Thanks! Stephan
Upvotes: 1
Views: 197
Reputation: 191
Here's an attempt, take a copy of each group/subgroups data frame, and cross join to the data. This is then filtered to only find the days before. This allows us for each day to calculate all of the cumulative sums
DF %>%
group_by(group, subgroup) %>%
mutate(day = row_number(), J = TRUE) %>%
nest() %>%
arrange(group, subgroup) %>%
mutate(data = map(data, function(d) {
inner_join(d, transmute(d, x = day, v = subgroup_value, J), by = "J") %>%
filter(day >= x) %>%
mutate(x = day - x + 1) %>%
arrange(day, x) %>%
group_by(date, group_value, date, subgroup_value) %>%
mutate(vv = cumsum(v),
vv = ifelse(vv >= group_value, vv, NA),
xx = ifelse(!is.na(vv), x, NA)) %>%
group_by(date, group_value, day, subgroup_value) %>%
summarise(x = min(xx, na.rm = TRUE), v = min(vv, na.rm = TRUE))
})) %>%
unnest()
Upvotes: 1