Reputation: 542
I have a data frame with dates that i would like to compare using group_by, but i need to compare them with ALL the other dates inside their group to generate a route without gaps or overlapping, so i can get only the maximum end date and the minimum beginning date for each ID, a reproducible example:
ID <- c(1,1,1,3,3,7,7,7,22,22,32,32,173,173,213,213,230,330,330,330,330,150579)
EndDate <- c("9999-12-31","2018-04-30","2015-07-31","9999-12-31","2008-07-26","9999-12-31","9999-12-31","2011-08-31","9999-12-31","2006-11-30","9999-12-31","2007-06-30","9999-12-31","2010-09-30","9999-12-31","2013-04-30","9999-12-31","9999-12-31","2016-12-31","2016-09-30","2015-08-31","9999-12-31")
BegDate <- c("2015-08-01","2017-10-23","1983-12-05","2015-11-12","2003-02-24","2017-04-01","2014-07-15","1991-11-04","2006-12-01","1979-08-01","2007-07-01","1979-08-01","2010-10-01","1987-04-01","1980-10-20","2008-05-01","1983-02-14","1982-01-01","2016-10-01","2015-09-01","2014-02-01","1982-09-01")
df_dates <- data.frame(ID,EndDate,BegDate)
ID EndDate BegDate
1 9999-12-31 2015-08-01
1 2018-04-30 2017-10-23
1 2015-07-31 1983-12-05
3 9999-12-31 2015-11-12
3 9999-12-31 2015-11-12
7 9999-12-31 2017-04-01
7 9999-12-31 2014-07-15
7 2011-08-31 1991-11-04
22 9999-12-31 2006-12-01
22 2006-11-30 1979-08-01
32 9999-12-31 2007-07-01
32 2007-06-30 1979-08-01
173 9999-12-31 2010-10-01
173 2010-09-30 1987-04-01
213 9999-12-31 1980-10-20
213 2013-04-30 2008-05-01
233 9999-12-31 2016-06-01
233 2016-05-31 1998-10-01
330 9999-12-31 1982-01-01
330 2016-12-31 2016-10-01
330 2016-09-30 2015-09-01
330 2015-08-31 2014-02-01
150579 9999-12-31 1982-09-01
I have tried dplyr, but don't know how to make the comparison between ALL the elements of the group. I've used a for loop, but the data frame is massive and speed is a must.
v_result <- c()
for(i in unique(df_dates$ID)){
df_temp <- df_dates[df_dates$ID == i,]
df_temp$EndDate <- as.Date(df_temp$EndDate,"%Y%m%d")
df_temp$BegDate <- as.Date(df_temp$BegDate,"%Y%m%d")
v_row <- (1:nrow(df_temp))
for (j in v_row){
h = j + 1
elm <- v_row[!v_row %in% j]
findNext <- FALSE
for(h in elm){
if((df_temp$EndDate[j] >= df_temp$EndDate[h] AND
df_temp$BegDate[j] <= df_temp$BegDate[h]) |
df_temp$BegDate[j] - days(1) == df_temp$EndDate[h]){
findNext <- TRUE
}
}
v_result <- c(v_result,findNext)
}
}
As you can see, many for loops and i don't understand the apply
family of functions very well, plus, there are probably over 150k ID's so not a viable option. My idea was to flag as false the overlapped ones and the ones that represented a gap and filter those, allowing me to take the maximum and minimum
df_final <- df_final%>%
group_by(ID)%>%
mutate(
Biggest = max(EndDate),
Lowest = min(BegDate)
)
generating something like this:
ID EndDate BegDate
1 9999-12-31 1983-12-05
3 9999-12-31 2015-11-12
7 9999-12-31 2014-07-15
22 9999-12-31 1979-08-01
32 9999-12-31 1979-08-01
173 9999-12-31 2017-07-01
213 9999-12-31 1980-10-20
233 9999-12-31 1998-10-01
330 9999-12-31 1982-01-01
150579 9999-12-31 1982-09-01
The End date could not always be 9999-12-31, just as long as it's the biggest date by ID that corresponds to a period without gaps and ignoring overlaps. I've been struggling with this for a few days now and can't make any progress.
Is there a way to do this with dplyr that is efficient for large dataframes?
Upvotes: 0
Views: 274
Reputation: 14764
The logic behind your final output is not entirely clear. For instance, let's do something with data.table
(should be efficient for larger dataframes) and magrittr
(for better readability):
library(data.table)
library(magrittr)
calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))
df_final <- setDT(df_dates) %>%
.[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
.[order(ID, BegDate),] %>%
.[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
.[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
.[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
.[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>%
.[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
.[is.na(gap_between), gap_between := 0] %>%
.[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
.[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>%
#.[, .SD[.N], by = ID] %>%
.[, ("fakeidx") := NULL]
The output here is:
ID BegDate EndDate
1: 1 1983-12-05 9999-12-31
2: 3 2003-02-24 2008-07-26
3: 3 2015-11-12 9999-12-31
4: 7 1991-11-04 2011-08-31
5: 7 2014-07-15 9999-12-31
6: 22 1979-08-01 9999-12-31
7: 32 1979-08-01 9999-12-31
8: 173 1987-04-01 9999-12-31
9: 213 1980-10-20 9999-12-31
10: 230 1983-02-14 9999-12-31
11: 330 1982-01-01 9999-12-31
12: 150579 1982-09-01 9999-12-31
If you take a look at the 2nd and 4th row, you will see that according to you they should not be there.
However there is a gap in-between, so we cannot just take the lowest BegDate
, and we need to arrive to this step in order to produce your final output.
For your final output, the assumption could then be that you want to get rid of anything before the gap occurred (i.e. to take only the last record per group). You can do this by simply uncommenting the line before the last one, i.e.:
library(data.table)
library(magrittr)
calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))
df_final <- setDT(df_dates) %>%
.[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
.[order(ID, BegDate),] %>%
.[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
.[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
.[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
.[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>%
.[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
.[is.na(gap_between), gap_between := 0] %>%
.[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
.[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>%
.[, .SD[.N], by = ID] %>%
.[, ("fakeidx") := NULL]
Producing:
ID BegDate EndDate
1: 1 1983-12-05 9999-12-31
2: 3 2015-11-12 9999-12-31
3: 7 2014-07-15 9999-12-31
4: 22 1979-08-01 9999-12-31
5: 32 1979-08-01 9999-12-31
6: 173 1987-04-01 9999-12-31
7: 213 1980-10-20 9999-12-31
8: 230 1983-02-14 9999-12-31
9: 330 1982-01-01 9999-12-31
10: 150579 1982-09-01 9999-12-31
Upvotes: 1
Reputation: 748
With no filter:
df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% group_by(ID) %>% summarize(max(EndDate), min(BegDate))
With filter before group_by(as in the for loop). Note the use of lead function to compare one date with the next row's date.
df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% filter(EndDate >= lead(EndDate) & BegDate <= lead(BegDate) | BegDate-1 == lead(EndDate) ) %>% group_by(ID) %>% summarize(max(EndDate), min(BegDate))
With filter after group_by(as the intention is not very clear from your example)
df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% group_by(ID) %>% filter(EndDate >= lead(EndDate) & BegDate <= lead(BegDate) | BegDate-1 == lead(EndDate) ) %>% summarize(max(EndDate), min(BegDate))
Upvotes: 0
Reputation: 7626
Starting with your code for creating the dataframe df_dates
above, the following code would produce the table at the bottom:
df_dates <- data.frame(ID,EndDate,BegDate)
df_dates %>%
mutate(EndDate=as.Date(EndDate, "%Y-%m-%d"), #Your as.Date calls above didn't include
BegDate=as.Date(BegDate, "%Y-%m-%d")) %>% #the '-' character between values
group_by(ID) %>%
summarise( #using 'summarise' produces one row per
Biggest = max(EndDate), #grouped 'ID'. 'mutate' keeps all rows.
Lowest = min(BegDate)
)
Hope that helps in how dplyr
would give you expected results?
Upvotes: 0