kstew
kstew

Reputation: 1114

R function for creating uneven groups based on uneven dates

I am trying to find an R function that can index groups iteratively, given a set of unevenly spaced dates, uneven group sizes, and by grouped cases. Here are example data:

> h
# A tibble: 20 x 2
      ID date      
   <int> <date>    
 1     1 2021-01-07
 2     1 2021-01-11
 3     1 2021-01-15
 4     1 2021-01-16
 5     1 2021-01-21
 6     1 2021-01-26
 7     1 2021-02-04
 8     1 2021-02-08
 9     1 2021-02-13
10     1 2021-02-20
11     1 2021-02-23
12     1 2021-02-27
13     2 2021-01-05
14     2 2021-01-11
15     2 2021-02-02
16     2 2021-02-08
17     2 2021-02-08
18     2 2021-02-14
19     2 2021-02-17
20     2 2021-02-21

For each unique ID, I want to find the first date (chronologically) and create a group (i.e., group==1) for that case and any other rows within 7 days. For the next date after 7 days, create a second group (i.e., group==2) for that case and any others within the next 7 days. Note: the next date is not necessarily exactly 7 days after the initial date. Repeat this process for the remaining remaining cases to get the desired output:

# A tibble: 20 x 3
      ID date       group
   <int> <date>     <dbl>
 1     1 2021-01-07     1
 2     1 2021-01-11     1
 3     1 2021-01-15     2
 4     1 2021-01-16     2
 5     1 2021-01-21     2
 6     1 2021-01-26     3
 7     1 2021-02-04     4
 8     1 2021-02-08     4
 9     1 2021-02-13     5
10     1 2021-02-20     5
11     1 2021-02-23     6
12     1 2021-02-27     6
13     2 2021-01-05     1
14     2 2021-01-11     1
15     2 2021-02-02     2
16     2 2021-02-08     2
17     2 2021-02-08     2
18     2 2021-02-14     3
19     2 2021-02-17     3
20     2 2021-02-21     3

Using a rolling window function of 7 days will not work, as far as I can tell, as it will group the cases incorrectly. But I am wondering if a sort of custom rolling window function could be used? I would prefer a solution using dplyr, but other options would also work. Any help here is appreciated.

> dput(h)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(18634, 
18638, 18642, 18643, 18648, 18653, 18662, 18666, 18671, 18678, 
18681, 18685, 18632, 18638, 18660, 18666, 18666, 18672, 18675, 
18679), class = "Date")), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 2

Views: 206

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269556

Define a function date1 which given the first date of the group of the prior row's point and the current row's date returns the date of the start of the current group -- that must be one of the two arguments. Then grouping by ID use Reduce to apply that to the dates in each ID and convert the result to factor and then to integer.

library(dplyr)

date1 <- function(prev, x) if (x > prev + 7) x else prev
h %>% 
  group_by(ID) %>%
  mutate(group = as.integer(factor(Reduce(date1, date, acc = TRUE)))) %>%
  ungroup

giving:

# A tibble: 20 x 3
      ID date       group
   <int> <date>     <dbl>
 1     1 2021-01-07     1
 2     1 2021-01-11     1
 3     1 2021-01-15     2
 4     1 2021-01-16     2
 5     1 2021-01-21     2
 6     1 2021-01-26     3
 7     1 2021-02-04     4
 8     1 2021-02-08     4
 9     1 2021-02-13     5
10     1 2021-02-20     5
11     1 2021-02-23     6
12     1 2021-02-27     6
13     2 2021-01-05     1
14     2 2021-01-11     1
15     2 2021-02-02     2
16     2 2021-02-08     2
17     2 2021-02-08     2
18     2 2021-02-14     3
19     2 2021-02-17     3
20     2 2021-02-21     3

Upvotes: 3

IceCreamToucan
IceCreamToucan

Reputation: 28685

For each ID group, create group as a vector of NAs. While some group elements are still NA, take the first date value where group is NA and add 0 and 7 days to it to make a range of dates. For any rows where date is in the calculated date range, set elements of group to 1 more than the current max value of group (or 0 if group is still all NA).

library(data.table)
setDT(df)

df[order(ID, date), {
     group <- rep(NA_real_, .N)
     while(any(is.na(group))){
       group_range <- first(date[is.na(group)]) + c(0, 7)
       group[date %between% group_range] <- 1 + max(fcoalesce(group, 0)) 
     }
     list(date, group) 
   }, by = ID]

# ID       date group
# 1:  1 2021-01-07     1
# 2:  1 2021-01-11     1
# 3:  1 2021-01-15     2
# 4:  1 2021-01-16     2
# 5:  1 2021-01-21     2
# 6:  1 2021-01-26     3
# 7:  1 2021-02-04     4
# 8:  1 2021-02-08     4
# 9:  1 2021-02-13     5
# 10:  1 2021-02-20     5
# 11:  1 2021-02-23     6
# 12:  1 2021-02-27     6
# 13:  2 2021-01-05     1
# 14:  2 2021-01-11     1
# 15:  2 2021-02-02     2
# 16:  2 2021-02-08     2
# 17:  2 2021-02-08     2
# 18:  2 2021-02-14     3
# 19:  2 2021-02-17     3
# 20:  2 2021-02-21     3

Here's another version where I try to limit the computations. No idea if it's actually faster

df[order(ID, date), {
     group <- rep(NA_integer_, .N)
     i <- 1L
     g <- 1L
     while(i <= .N){
       group_range <- date[i] + c(0, 7)
       chg <- date %between% group_range
       group[chg] <- g
       g <- g + 1L
       i <- i + sum(chg)
     }
     list(date, group) 
   }, by = ID]

Upvotes: 0

Related Questions