Reputation: 35
I have an animal tracking dataset which is as shown below
Id Start Stop Status
78122 10/12/1919 10/12/1919 Birth
78122 1/18/1966 2/2/1972 In
78122 2/3/1972 9/8/1972 In
78122 9/9/1972 1/23/1974 In
78122 1/24/1974 10/22/1975 Out
78122 10/23/1975 5/4/1979 Out
78122 5/5/1979 8/29/1980 Out
78122 8/30/1980 5/14/1988 Out
78122 5/15/1988 6/18/1988 In
78122 6/19/1988 1/12/1989 In
78122 1/13/1989 2/23/1990 In
78122 2/24/1990 6/15/1991 Out
78122 6/16/1991 2/11/1993 Out
78122 2/12/1993 5/3/1994 Out
78122 5/4/1994 7/27/1994 In
78122 7/22/1994 1/25/1996 Out
78122 1/26/1996 11/13/2001 In
78122 11/14/2001 11/19/2001 In
78122 11/20/2001 9/1/2009 In
78122 9/26/2009 9/26/2009 Death
This animal was born in 1919 but moved in and out of its native territory multiple times. What I want to create is a dataset like this. I like to summarize the min(Start)
and max(Stop)
dates by Status.
For example: there are three rows indicating the animal was in the territory between 1/18/1966
to 1/23/1974
.
Id Start Stop Status
78122 1/18/1966 2/2/1972 In
78122 2/3/1972 9/8/1972 In
78122 9/9/1972 1/23/1974 In
This information should be summarized into 1 single row with min(Start)
and max(Stop)
like this
Id MinStart MaxStop Status
78122 1/18/1966 1/23/1974 In
Again there are four rows indicating the animal was out of territory between 1/24/1974
to 5/14/1988
.
Id Start Stop Status
78122 1/24/1974 10/22/1975 Out
78122 10/23/1975 5/4/1979 Out
78122 5/5/1979 8/29/1980 Out
78122 8/30/1980 5/14/1988 Out
This information should be summarized into 1 single row with min(Start)
and max(Stop)
like this
Id MinStart MaxStop Status
78122 1/24/1974 5/14/1988 Out
Similarly for other In and Out status. The final dataset should look like this below.
Id MinStart MaxStop Status
78122 10/12/1919 10/12/1919 Birth
78122 1/18/1966 1/23/1974 In
78122 1/24/1974 5/14/1988 Out
78122 5/15/1988 2/23/1990 In
78122 2/24/1990 5/3/1994 Out
78122 5/4/1994 7/27/1994 In
78122 7/28/1994 1/25/1996 Out
78122 1/26/1996 9/1/2009 In
78122 9/26/2009 9/26/2009 Death
Any suggestions on how to rearrange this dataset based on the criteria above is much apricated. So far I tried
test1 <- testcase %>%
group_by(ID,Status) %>%
summarize(MinStart = min(Start), MaxStop= max(Stop))
but this doesn't seem to work. It just creates one min and stop date for all the In Status and Out Status together. That is incorrect.
Upvotes: 3
Views: 378
Reputation: 1780
You can for example use insurancerating::reduce()
:
library(insurancerating)
library(dplyr)
library(lubridate)
d %>%
mutate(across(c(Start, Stop), lubridate::mdy)) %>%
insurancerating::reduce(d_date, begin = Start, end = Stop, Id, Status)
Id Status index Start Stop
# 1 78122 Birth 0 1919-10-12 1919-10-12
# 2 78122 Death 0 2009-09-26 2009-09-26
# 3 78122 In 0 1966-01-18 1974-01-23
# 4 78122 In 1 1988-05-15 1990-02-23
# 5 78122 In 2 1994-05-04 1994-07-27
# 6 78122 In 3 1996-01-26 2009-09-01
# 7 78122 Out 0 1974-01-24 1988-05-14
# 8 78122 Out 1 1990-02-24 1994-05-03
# 9 78122 Out 2 1994-07-22 1996-01-25
# 10 78123 Birth 0 1919-10-12 1919-10-12
# 11 78123 Death 0 2009-09-26 2009-09-26
# 12 78123 In 0 1966-01-18 1974-01-23
# 13 78123 In 1 1988-05-15 1990-02-23
# 14 78123 In 2 1994-05-04 1994-07-27
# 15 78123 In 3 1996-01-26 2009-09-01
# 16 78123 Out 0 1974-01-24 1988-05-14
# 17 78123 Out 1 1990-02-24 1994-05-03
# 18 78123 Out 2 1994-07-22 1996-01-25
Note: d
is the data as given by @jay.sf
Upvotes: 0
Reputation: 72683
One way to do this is to capture the dates as such while forcing them into numeric values with sapply
, to be able to use range
later on. Then, within ave
we use rle
in an mapply
to let a variable x grow by 1 every time Status changes. We now may easily aggregate
the range
s over Id and x, where a columns subset already gives us the result, which we just need to convert as.Date
and cbind
the suffix of x to it with gsub
.
d[2:3] <- sapply(d[2:3], function(x) as.Date(x, "%m/%d/%Y"))
f <- function(x) {r <- rle(x)$l;unlist(mapply(rep, seq(r), r))}
d <- transform(d, x=paste(Id, ave(Status, Id, FUN=f), Status))
r <- do.call(data.frame, aggregate(cbind(Start, Stop) ~ Id + x, d, FUN=range))[c(1:3, 6)]
r[3:4] <- lapply(r[3:4], as.Date, origin="1970-01-01")
r <- cbind(r[1], setNames(r[3:4], c("MinStart", "MaxStop")), Status=gsub(".*\\s", "", r$x))
r[order(r$Id), ]
# Id MinStart MaxStop Status
# 1 78122 1919-10-12 1919-10-12 Birth
# 2 78122 1966-01-18 1974-01-23 In
# 3 78122 1974-01-24 1988-05-14 Out
# 4 78122 1988-05-15 1990-02-23 In
# 5 78122 1990-02-24 1994-05-03 Out
# 6 78122 1994-05-04 1994-07-27 In
# 7 78122 1994-07-22 1996-01-25 Out
# 8 78122 1996-01-26 2009-09-01 In
# 9 78122 2009-09-26 2009-09-26 Death
# 10 78123 1919-10-12 1919-10-12 Birth
# 11 78123 1966-01-18 1974-01-23 In
# 12 78123 1974-01-24 1988-05-14 Out
# 13 78123 1988-05-15 1990-02-23 In
# 14 78123 1990-02-24 1994-05-03 Out
# 15 78123 1994-05-04 1994-07-27 In
# 16 78123 1994-07-22 1996-01-25 Out
# 17 78123 1996-01-26 2009-09-01 In
# 18 78123 2009-09-26 2009-09-26 Death
Data:
Note: Data frame is doubled and Id is incremented by one for demonstration purposes.
d <- structure(list(Id = c(78122L, 78122L, 78122L, 78122L, 78122L,
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L,
78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78122L, 78123L,
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L,
78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L, 78123L,
78123L, 78123L, 78123L), Start = c("10/12/1919", "1/18/1966",
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979",
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990",
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996",
"11/14/2001", "11/20/2001", "9/26/2009", "10/12/1919", "1/18/1966",
"2/3/1972", "9/9/1972", "1/24/1974", "10/23/1975", "5/5/1979",
"8/30/1980", "5/15/1988", "6/19/1988", "1/13/1989", "2/24/1990",
"6/16/1991", "2/12/1993", "5/4/1994", "7/22/1994", "1/26/1996",
"11/14/2001", "11/20/2001", "9/26/2009"), Stop = c("10/12/1919",
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979",
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990",
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996",
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009", "10/12/1919",
"2/2/1972", "9/8/1972", "1/23/1974", "10/22/1975", "5/4/1979",
"8/29/1980", "5/14/1988", "6/18/1988", "1/12/1989", "2/23/1990",
"6/15/1991", "2/11/1993", "5/3/1994", "7/27/1994", "1/25/1996",
"11/13/2001", "11/19/2001", "9/1/2009", "9/26/2009"), Status = c("Birth",
"In", "In", "In", "Out", "Out", "Out", "Out", "In", "In", "In",
"Out", "Out", "Out", "In", "Out", "In", "In", "In", "Death",
"Birth", "In", "In", "In", "Out", "Out", "Out", "Out", "In",
"In", "In", "Out", "Out", "Out", "In", "Out", "In", "In", "In",
"Death")), class = "data.frame", row.names = c(NA, -40L))
Upvotes: 0
Reputation: 24790
You need some run length encoding. I'm going to use data.table::rleid
for convenience, but you can use the base version if you want:
library(data.table)
testcase %>%
group_by(Id, RLE = rleid(Status)) %>%
arrange(Start) %>%
dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups: Id [1]
Id RLE Start Stop Status
<int> <int> <date> <date> <chr>
1 78122 1 1919-10-12 1919-10-12 Birth
2 78122 2 1966-01-18 1974-01-23 In
3 78122 3 1974-01-24 1988-05-14 Out
4 78122 4 1988-05-15 1990-02-23 In
5 78122 5 1990-02-24 1994-05-03 Out
6 78122 6 1994-05-04 1994-07-27 In
7 78122 7 1994-07-22 1996-01-25 Out
8 78122 8 1996-01-26 2009-09-01 In
9 78122 9 2009-09-26 2009-09-26 Death
Note that I converted your dates into class date
, which I will leave to you. Otherwise they don't sort correctly.
And here is the group_by
call without data.table
...
group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...
Upvotes: 1