mimibao1009
mimibao1009

Reputation: 89

Fast way to count days from a list between multiple date ranges in R?

I searched for a while but seems most of the days counting methods are for two dates only(i.e., single date range). I don't know if there's a way faster than using a for-loop to calculate days between multiple date ranges.

Here's my data DF:

ItemName  StartDate    EndDate
Apple     2014-01-02   2018-01-01
Orange    2014-08-06   2017-02-03
Banana    2015-03-01   2019-12-31
...

Now I have a reporting date range: 2015-01-01 to 2019-12-31. I want to count days of those items covered by the reporting dates by quarter.

For example, the first quarter of the reporting dates are between 2015-01-01 & 2015-03-31 (including both end)

ItemName  StartDate    EndDate      CountDays  SampleQuarter
Apple     2014-01-02   2018-01-02   90         2015-Q1
Orange    2014-08-06   2017-02-03   90         2015-Q1
Banana    2015-03-01   2019-12-31   31         2015-Q1
...
Apple     2014-01-02   2018-01-02    2         2018-Q1
Orange    2014-08-06   2017-02-03    0         2018-Q1
Banana    2015-03-01   2019-12-31   31         2018-Q1
....

My goal is to use the above result to get this table:

SampleQuarter  TotalDays
2015-Q1        211
2015-Q2        273
...
2018-Q1         33
2018-Q2         91
... 

I use a for loop to handle it. First I created a reference date list:

Date = c("2015-01-01","2015-04-01","2015-06-01",....)

TotalCount = NULL
for(i in 1:length(Date)){
   START = as.Date(Date[i])
   END = START %m+% months(3) -1 ## the end of each quarter 
   
   ## find items that start and end date range is covered by the selected quarter
   ## than calculated the days covered by the selected quarter only  
   df = DF %>% filter( StartDate <=END & EndDate >=END )   %>%
        mutate(StartDateNEW = ifelse(StartDate <= START,START, StartDate ))%>%
        mutate(EndDateNEW = ifelse(EndDate <= END,EndDate , END)) %>% 
        mutate(CountDays= EndDateNEW -StartDateNEW +1) %>% 
        select(-StartDateNEW, EndDateNEW ) %>% 
        mutate(SampleQuarter =paste0( format(START , "%Y") ,"-Q",quarter(START) )  %>%
        group_by(SampleQuarter) %>% summarise(TotalDays = sum(CountDays)) 
   
   TotalCount = rbind(TotalCount ,df)
   START = START %m+% months(3) ## the beginning of next quarter 
}

For a small sample of data, the for-loop works well. When the sample size is huge, the code will take a while. I wonder if there's a method that I can bypass the for-loop and make the whole process faster?

Thanks in advance.

Upvotes: 1

Views: 144

Answers (1)

akrun
akrun

Reputation: 887501

Perhaps we can use

library(dplyr)
library(purrr)
library(tidyr)
library(zoo)
DF %>% 
  mutate(across(ends_with("Date"), as.Date),
         new = map2(StartDate, EndDate,
        ~ tibble(days = seq(.x, .y, by = 'day'), 
      Quarter = cut(days, breaks = seq(.x, .y, by = 'quarter'))))) %>% 
   unnest(c(new)) %>% 
   group_by(ItemName, StartDate, EndDate, Quarter = as.yearqtr(Quarter)) %>%
   summarise(n = n(), .groups = 'drop')

Upvotes: 2

Related Questions