Reputation: 542

How do i compare all values against the next ones using group_by in R

I have a data frame with dates that i would like to compare using group_by, but i need to compare them with ALL the other dates inside their group to generate a route without gaps or overlapping, so i can get only the maximum end date and the minimum beginning date for each ID, a reproducible example:

ID <- c(1,1,1,3,3,7,7,7,22,22,32,32,173,173,213,213,230,330,330,330,330,150579)
EndDate <- c("9999-12-31","2018-04-30","2015-07-31","9999-12-31","2008-07-26","9999-12-31","9999-12-31","2011-08-31","9999-12-31","2006-11-30","9999-12-31","2007-06-30","9999-12-31","2010-09-30","9999-12-31","2013-04-30","9999-12-31","9999-12-31","2016-12-31","2016-09-30","2015-08-31","9999-12-31")
BegDate <- c("2015-08-01","2017-10-23","1983-12-05","2015-11-12","2003-02-24","2017-04-01","2014-07-15","1991-11-04","2006-12-01","1979-08-01","2007-07-01","1979-08-01","2010-10-01","1987-04-01","1980-10-20","2008-05-01","1983-02-14","1982-01-01","2016-10-01","2015-09-01","2014-02-01","1982-09-01")

df_dates <- data.frame(ID,EndDate,BegDate)

ID       EndDate       BegDate
1        9999-12-31    2015-08-01
1        2018-04-30    2017-10-23
1        2015-07-31    1983-12-05
3        9999-12-31    2015-11-12
3        9999-12-31    2015-11-12
7        9999-12-31    2017-04-01
7        9999-12-31    2014-07-15
7        2011-08-31    1991-11-04
22       9999-12-31    2006-12-01
22       2006-11-30    1979-08-01
32       9999-12-31    2007-07-01
32       2007-06-30    1979-08-01
173      9999-12-31    2010-10-01
173      2010-09-30    1987-04-01
213      9999-12-31    1980-10-20
213      2013-04-30    2008-05-01
233      9999-12-31    2016-06-01
233      2016-05-31    1998-10-01
330      9999-12-31    1982-01-01
330      2016-12-31    2016-10-01
330      2016-09-30    2015-09-01
330      2015-08-31    2014-02-01
150579   9999-12-31    1982-09-01

I have tried dplyr, but don't know how to make the comparison between ALL the elements of the group. I've used a for loop, but the data frame is massive and speed is a must.

v_result <- c()

for(i in unique(df_dates$ID)){
   df_temp <- df_dates[df_dates$ID == i,]

   df_temp$EndDate <- as.Date(df_temp$EndDate,"%Y%m%d")
   df_temp$BegDate <- as.Date(df_temp$BegDate,"%Y%m%d")

   v_row <- (1:nrow(df_temp))

   for (j in v_row){
      h = j + 1
      elm <- v_row[!v_row %in% j]
      findNext <- FALSE

      for(h in elm){
        if((df_temp$EndDate[j] >= df_temp$EndDate[h] AND 
            df_temp$BegDate[j] <= df_temp$BegDate[h]) |
            df_temp$BegDate[j] - days(1) == df_temp$EndDate[h]){
              findNext <- TRUE
         }
      }
   v_result <- c(v_result,findNext) 
   } 
}

As you can see, many for loops and i don't understand the apply family of functions very well, plus, there are probably over 150k ID's so not a viable option. My idea was to flag as false the overlapped ones and the ones that represented a gap and filter those, allowing me to take the maximum and minimum

df_final <- df_final%>%
   group_by(ID)%>%
   mutate(
    Biggest = max(EndDate),
    Lowest = min(BegDate)
   )

generating something like this:

ID       EndDate       BegDate
1        9999-12-31    1983-12-05
3        9999-12-31    2015-11-12
7        9999-12-31    2014-07-15
22       9999-12-31    1979-08-01
32       9999-12-31    1979-08-01
173      9999-12-31    2017-07-01
213      9999-12-31    1980-10-20
233      9999-12-31    1998-10-01
330      9999-12-31    1982-01-01
150579   9999-12-31    1982-09-01

The End date could not always be 9999-12-31, just as long as it's the biggest date by ID that corresponds to a period without gaps and ignoring overlaps. I've been struggling with this for a few days now and can't make any progress.

Is there a way to do this with dplyr that is efficient for large dataframes?

Upvotes: 0

Answers (3)

arg0naut91

Reputation: 14764

The logic behind your final output is not entirely clear. For instance, let's do something with data.table (should be efficient for larger dataframes) and magrittr (for better readability):

library(data.table)
library(magrittr)

calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

df_final <- setDT(df_dates) %>%
  .[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
            EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
  .[order(ID, BegDate),] %>%
  .[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
  .[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
  .[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
  .[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>% 
  .[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
  .[is.na(gap_between), gap_between := 0] %>% 
  .[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
  .[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>% 
  #.[, .SD[.N], by = ID] %>%
  .[, ("fakeidx") := NULL]

The output here is:

        ID    BegDate    EndDate
 1:      1 1983-12-05 9999-12-31
 2:      3 2003-02-24 2008-07-26
 3:      3 2015-11-12 9999-12-31
 4:      7 1991-11-04 2011-08-31
 5:      7 2014-07-15 9999-12-31
 6:     22 1979-08-01 9999-12-31
 7:     32 1979-08-01 9999-12-31
 8:    173 1987-04-01 9999-12-31
 9:    213 1980-10-20 9999-12-31
10:    230 1983-02-14 9999-12-31
11:    330 1982-01-01 9999-12-31
12: 150579 1982-09-01 9999-12-31

If you take a look at the 2nd and 4th row, you will see that according to you they should not be there.

However there is a gap in-between, so we cannot just take the lowest BegDate, and we need to arrive to this step in order to produce your final output.

For your final output, the assumption could then be that you want to get rid of anything before the gap occurred (i.e. to take only the last record per group). You can do this by simply uncommenting the line before the last one, i.e.:

library(data.table)
library(magrittr)

calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

df_final <- setDT(df_dates) %>%
  .[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
            EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
  .[order(ID, BegDate),] %>%
  .[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
  .[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
  .[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
  .[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>% 
  .[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
  .[is.na(gap_between), gap_between := 0] %>% 
  .[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
  .[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>% 
  .[, .SD[.N], by = ID] %>%
  .[, ("fakeidx") := NULL]

Producing:

        ID    BegDate    EndDate
 1:      1 1983-12-05 9999-12-31
 2:      3 2015-11-12 9999-12-31
 3:      7 2014-07-15 9999-12-31
 4:     22 1979-08-01 9999-12-31
 5:     32 1979-08-01 9999-12-31
 6:    173 1987-04-01 9999-12-31
 7:    213 1980-10-20 9999-12-31
 8:    230 1983-02-14 9999-12-31
 9:    330 1982-01-01 9999-12-31
10: 150579 1982-09-01 9999-12-31

Upvotes: 1

JineshEP

Reputation: 748

With no filter:

 df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% group_by(ID) %>% summarize(max(EndDate), min(BegDate))

With filter before group_by(as in the for loop). Note the use of lead function to compare one date with the next row's date.

df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% filter(EndDate >= lead(EndDate) & BegDate <= lead(BegDate) | BegDate-1 == lead(EndDate) ) %>% group_by(ID) %>% summarize(max(EndDate), min(BegDate))

With filter after group_by(as the intention is not very clear from your example)

df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>%  group_by(ID) %>% filter(EndDate >= lead(EndDate) & BegDate <= lead(BegDate) | BegDate-1 == lead(EndDate) )  %>% summarize(max(EndDate), min(BegDate))

Upvotes: 0

Andy Baxter

Reputation: 7626

Starting with your code for creating the dataframe df_dates above, the following code would produce the table at the bottom:

df_dates <- data.frame(ID,EndDate,BegDate)

df_dates %>%
  mutate(EndDate=as.Date(EndDate, "%Y-%m-%d"),      #Your as.Date calls above didn't include
         BegDate=as.Date(BegDate, "%Y-%m-%d")) %>%  #the '-' character between values
  group_by(ID) %>%
  summarise(                                        #using 'summarise' produces one row per
    Biggest = max(EndDate),                         #grouped 'ID'. 'mutate' keeps all rows.
    Lowest = min(BegDate)
  )

Hope that helps in how dplyr would give you expected results?

Upvotes: 0

How do i compare all values against the next ones using group_by in R

Answers (3)

Related Questions