Riley Stephen
Riley Stephen

Reputation: 35

r min max dates by id and multiple status changes within ID

I have an animal tracking dataset which is as shown below

 Id         Start       Stop          Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out
 78122      5/15/1988   6/18/1988     In
 78122      6/19/1988   1/12/1989     In
 78122      1/13/1989   2/23/1990     In
 78122      2/24/1990   6/15/1991     Out
 78122      6/16/1991   2/11/1993     Out
 78122      2/12/1993   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/22/1994   1/25/1996     Out
 78122      1/26/1996   11/13/2001    In
 78122      11/14/2001  11/19/2001    In
 78122      11/20/2001  9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

This animal was born in 1919 but moved in and out of its native territory multiple times. What I want to create is a dataset like this. I like to summarize the min(Start) and max(Stop) dates by Status.

For example: there are three rows indicating the animal was in the territory between 1/18/1966 to 1/23/1974.

 Id         Start       Stop          Status
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In

This information should be summarized into 1 single row with min(Start) and max(Stop) like this

 Id         MinStart    MaxStop       Status
 78122      1/18/1966   1/23/1974     In

Again there are four rows indicating the animal was out of territory between 1/24/1974 to 5/14/1988.

 Id         Start       Stop          Status
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out

This information should be summarized into 1 single row with min(Start) and max(Stop) like this

 Id         MinStart    MaxStop       Status
 78122      1/24/1974   5/14/1988     Out

Similarly for other In and Out status. The final dataset should look like this below.

 Id         MinStart    MaxStop       Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   1/23/1974     In
 78122      1/24/1974   5/14/1988     Out
 78122      5/15/1988   2/23/1990     In
 78122      2/24/1990   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/28/1994   1/25/1996     Out
 78122      1/26/1996   9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

Any suggestions on how to rearrange this dataset based on the criteria above is much apricated. So far I tried

 test1 <- testcase %>% 
          group_by(ID,Status) %>% 
          summarize(MinStart  = min(Start), MaxStop= max(Stop))

but this doesn't seem to work. It just creates one min and stop date for all the In Status and Out Status together. That is incorrect.

Upvotes: 3

Views: 378

Answers (3)

mharinga
mharinga

Reputation: 1780

You can for example use insurancerating::reduce():

library(insurancerating)
library(dplyr)
library(lubridate)

d %>% 
  mutate(across(c(Start, Stop), lubridate::mdy)) %>%
  insurancerating::reduce(d_date, begin = Start, end = Stop, Id, Status)

      Id Status index      Start       Stop
# 1  78122  Birth     0 1919-10-12 1919-10-12
# 2  78122  Death     0 2009-09-26 2009-09-26
# 3  78122     In     0 1966-01-18 1974-01-23
# 4  78122     In     1 1988-05-15 1990-02-23
# 5  78122     In     2 1994-05-04 1994-07-27
# 6  78122     In     3 1996-01-26 2009-09-01
# 7  78122    Out     0 1974-01-24 1988-05-14
# 8  78122    Out     1 1990-02-24 1994-05-03
# 9  78122    Out     2 1994-07-22 1996-01-25
# 10 78123  Birth     0 1919-10-12 1919-10-12
# 11 78123  Death     0 2009-09-26 2009-09-26
# 12 78123     In     0 1966-01-18 1974-01-23
# 13 78123     In     1 1988-05-15 1990-02-23
# 14 78123     In     2 1994-05-04 1994-07-27
# 15 78123     In     3 1996-01-26 2009-09-01
# 16 78123    Out     0 1974-01-24 1988-05-14
# 17 78123    Out     1 1990-02-24 1994-05-03
# 18 78123    Out     2 1994-07-22 1996-01-25

Note: d is the data as given by @jay.sf

Upvotes: 0

jay.sf
jay.sf

Reputation: 72683

One way to do this is to capture the dates as such while forcing them into numeric values with sapply, to be able to use range later on. Then, within ave we use rle in an mapply to let a variable x grow by 1 every time Status changes. We now may easily aggregate the ranges over Id and x, where a columns subset already gives us the result, which we just need to convert as.Date and cbind the suffix of x to it with gsub.

d[2:3] <- sapply(d[2:3], function(x) as.Date(x, "%m/%d/%Y"))
f <- function(x) {r <- rle(x)$l;unlist(mapply(rep, seq(r), r))}
d <- transform(d, x=paste(Id, ave(Status, Id, FUN=f), Status))
r <- do.call(data.frame, aggregate(cbind(Start, Stop) ~ Id + x, d, FUN=range))[c(1:3, 6)]
r[3:4] <- lapply(r[3:4], as.Date, origin="1970-01-01")
r <- cbind(r[1], setNames(r[3:4], c("MinStart", "MaxStop")), Status=gsub(".*\\s", "", r$x))

Result

r[order(r$Id), ]
#       Id   MinStart    MaxStop Status
# 1  78122 1919-10-12 1919-10-12  Birth
# 2  78122 1966-01-18 1974-01-23     In
# 3  78122 1974-01-24 1988-05-14    Out
# 4  78122 1988-05-15 1990-02-23     In
# 5  78122 1990-02-24 1994-05-03    Out
# 6  78122 1994-05-04 1994-07-27     In
# 7  78122 1994-07-22 1996-01-25    Out
# 8  78122 1996-01-26 2009-09-01     In
# 9  78122 2009-09-26 2009-09-26  Death
# 10 78123 1919-10-12 1919-10-12  Birth
# 11 78123 1966-01-18 1974-01-23     In
# 12 78123 1974-01-24 1988-05-14    Out
# 13 78123 1988-05-15 1990-02-23     In
# 14 78123 1990-02-24 1994-05-03    Out
# 15 78123 1994-05-04 1994-07-27     In
# 16 78123 1994-07-22 1996-01-25    Out
# 17 78123 1996-01-26 2009-09-01     In
# 18 78123 2009-09-26 2009-09-26  Death

Data:

Note: Data frame is doubled and Id is incremented by one for demonstration purposes.

d <- structure(list(Id = c(78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 
78123L, 78123L, 78123L), Start = c("10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009", "10/12/1919", "1/18/1966", 
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979", 
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990", 
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996", 
"11/14/2001", "11/20/2001", "9/26/2009"), Stop = c("10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009", "10/12/1919", 
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979", 
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990", 
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996", 
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009"), Status = c("Birth", 
"In", "In", "In", "Out", "Out", "Out", "Out", "In", "In", "In", 
"Out", "Out", "Out", "In", "Out", "In", "In", "In", "Death", 
"Birth", "In", "In", "In", "Out", "Out", "Out", "Out", "In", 
"In", "In", "Out", "Out", "Out", "In", "Out", "In", "In", "In", 
"Death")), class = "data.frame", row.names = c(NA, -40L))

Upvotes: 0

Ian Campbell
Ian Campbell

Reputation: 24790

You need some run length encoding. I'm going to use data.table::rleid for convenience, but you can use the base version if you want:

library(data.table)
testcase %>% 
  group_by(Id, RLE = rleid(Status)) %>%
  arrange(Start) %>%
  dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups:   Id [1]
     Id   RLE Start      Stop       Status
  <int> <int> <date>     <date>     <chr> 
1 78122     1 1919-10-12 1919-10-12 Birth 
2 78122     2 1966-01-18 1974-01-23 In    
3 78122     3 1974-01-24 1988-05-14 Out   
4 78122     4 1988-05-15 1990-02-23 In    
5 78122     5 1990-02-24 1994-05-03 Out   
6 78122     6 1994-05-04 1994-07-27 In    
7 78122     7 1994-07-22 1996-01-25 Out   
8 78122     8 1996-01-26 2009-09-01 In    
9 78122     9 2009-09-26 2009-09-26 Death 

Note that I converted your dates into class date, which I will leave to you. Otherwise they don't sort correctly.

And here is the group_by call without data.table

...
  group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...

Upvotes: 1

Related Questions