Jacek Kotowski
Jacek Kotowski

Reputation: 704

Dplyr/Lubridate: How to summarise overlapping intervals after grouping

I would like to group agreements and then compare how much their periods overlap (or are apart).

My dataframe may look like:

library(tidyverse)
library(lubridate)

tribble(
~ShipTo,    ~Code,  ~Start, ~End,
"xxxx", "AAA11",    2018-01-01, 2018-03-01,
"yyyy", "BBB23",    2018-02-01, 2018-05-11,
"yyyy", "BBB23",    2018-03-01, 2018-06-11,
"cccc", "AAA11",    2018-01-06, 2018-03-12,
"yyyy", "CCC04",    2018-01-16, 2018-03-31,
"xxxx", "DDD",    2018-01-21,   2018-03-25
)

I would like to mutate a column to create lubridate periods and evaluate them after grouping by ShipTo and Code. What I tried was:

dft3<-dft %>% filter(concat1 %in% to_filter2)  %>%
  arrange(ShipTo,Code)%>% 
  group_by(ShipTo,Code)%>%
  mutate(period=interval(Start,End), 
         nextperiod=interval(lead(Start),lead(End)),
         interv=day(as.period(intersect(period, nextperiod), "days"))) %>%
  group_by(ShipTo,Code)%>%
  summarise(count=n(),
    intervmax=max(interv),
    intervmin=min(interv)) 

If I remove the line group_by(ShipTo,Code)%>% the intervals are created correctly and also the lead intervals are correctly calculated from the next line. But when I naively use group_by, the intervals are not calculated correctly.

I suspect that perhaps my database should be split into many tables by groups and then, after the operation of creating and comparing intervals it should be glued back together.

Is there a succinct way to do it? Or perhaps there is a simpler way I have not yet learned? Thank you in advance for the hint in the right direction.

EDIT: The desired output should be a column with value of overlaps of intervals in days (or distances between intervals if no overlap). Grouping destroys the calculation. I would like to have these values calculated within groups (not accross them).

EDIT2: I trying to solve the problem by splitting dataframe into a list of dataframes and then combining it, but I am not sure of a syntax. It does not quite work, produces tables with one column, a help I was given on other portal (perhaps it can ilustrate the issue). The idea is to split a database, create new columns and combine the tables to a single table.

    fnOverlaps <- function(x) {

      mutate(x,okres=interval(Start,End),
             nastokres=interval(lead(Start),lead(End)), 
             interv=day(as.period(intersect(okres, nastokres), "days"))) 
    }

dft3<-dft3 %>% 
  split(list(.$ShipTo, .$Code), drop = TRUE)  %>%   
  map_df(fnOverlaps) %>% 
  flatten_dfr()

The result (for one group) that I expect would look like this.

tribble(
~ShipTo,    ~Code,  ~interv,    
"yyyy", "BBB23",    70        #say there is a 70 days overlap
"yyyy", "BBB23",    NA        #there is no next row to compare

)

Upvotes: 2

Views: 2002

Answers (2)

Jacek Kotowski
Jacek Kotowski

Reputation: 704

I am putting it just for the record. I received an answer from Jake Knaupp on slack r4ds group with the modern map_df() syntax, it calculates overlap of periods but it converts periods to numeric. And there is a bunch of warnings it will do that.

myFun <- function(x) {

  mutate(x,period=interval(Start,End),
       nextperiod=interval(lead(Start),lead(End)), 
       interv=day(as.period(intersect(period, nextperiod), "days"))) 
  }

df %>% 
  split(list(.$ShipTo, .$Code), drop = TRUE) %>% 
  map_df(myFun)

Upvotes: 0

Mark Peterson
Mark Peterson

Reputation: 9560

It looks like the issue is being caused by trying to combine vectors with the class "Interval." Specifically, they appear to be getting converted to numeric and losing their inherent information.

I think the only viable solution is to split the data.frame, run the analysis on each component separately with lapply, then bring them back together with bind_rows. The number of groups with only one entry present an issue as max and min return -Inf and Inf when the argument is empty after removing NAs. But, that is easy enough to correct for.

This code should work. Note that I am using group_by to ensure the ShipTo/Code columns are kept, though you could do that in other ways.

dft %>%
  split(paste(.$ShipTo, "XXX", .$Code)) %>%
  lapply(function(x){
    x %>%
      arrange(ShipTo,Code) %>% 
      mutate(period=interval(Start,End)
             , nextperiod=interval(lead(Start),lead(End))
             , interv=day(as.period(intersect(period, nextperiod), "days"))
      ) %>%
      group_by(ShipTo,Code)%>%
      summarise(count=n(),
                intervmax=max(interv, na.rm = TRUE),
                intervmin=min(interv, na.rm = TRUE)) %>%
      ungroup()
  }) %>%
  bind_rows() %>%
  mutate(intervmax = ifelse(is.infinite(intervmax)
                            , NA, intervmax)
         , intervmin = ifelse(is.infinite(intervmin)
                              , NA, intervmin))

Returns

# A tibble: 5 x 5
  ShipTo Code  count intervmax intervmin
  <chr>  <chr> <int>     <dbl>     <dbl>
1 cccc   AAA11     1      NA        NA  
2 xxxx   AAA11     1      NA        NA  
3 xxxx   DDD       1      NA        NA  
4 yyyy   BBB23     2      71.0      71.0
5 yyyy   CCC04     1      NA        NA 

Upvotes: 1

Related Questions