Reputation: 816

Creating Repeated Start and End Dates

I have a data set with many variables. Those of interest are: ID, Episode, Start, End, Assessment Date. An example data set is shown

 ID Episode     Start         End  AssessmentDate
 1       1  1/1/2012  12/21/2012        1/1/2012
 1       1  1/1/2010  12/21/2012      12/12/2012
 1       1  1/1/2010  12/21/2012      12/21/2012
 1       2  1/1/2013           .        1/2/2013
 1       2  1/1/2013           .        2/2/2013
 1       2  1/1/2013           .        3/2/2013
 2       1  1/1/2012           .        4/1/2012
 2       1  1/1/2010           .       5/12/2012
 2       1  1/1/2010           .       6/21/2012
 2       2  1/1/2013           .        7/2/2013
 2       2  1/1/2013           .        8/2/2013
 2       2  1/1/2013           .        9/2/2013

I have start dates for everyone, but not any end dates. I want to identify the end date for each episode and for each patient, for 10,000 patients. I want the end date to be the last date of assessment per episode number, and I want this to be present for each row between the first and last assessment dates.

I was reading a bit about splitting the data set into many smaller parts based on the ID and Episode, but I feel like there should be a simpler way to do this. I'm new to R, coming from SAS, and an issue like this in SAS would not give me too much trouble.

I would appreciate any input you may have regarding my data preparations.

Upvotes: 0

Answers (3)

akrun

Reputation: 887911

Or using data.table (data from @MrFlicks post)

 library(data.table)
 setDT(dd)[, NewEnd:=max(AssessmentDate), by=Episode]
 dd
 #      ID Episode      Start        End AssessmentDate     NewEnd
 #1:  1       1 2012-01-01 2012-12-21     2012-01-01 2012-12-21
 #2:  1       1 2010-01-01 2012-12-21     2012-12-12 2012-12-21
 #3:  1       1 2010-01-01 2012-12-21     2012-12-21 2012-12-21
 #4:  1       2 2013-01-01       <NA>     2013-01-02 2013-03-02
 #5:  1       2 2013-01-01       <NA>     2013-02-02 2013-03-02
 #6:  1       2 2013-01-01       <NA>     2013-03-02 2013-03-02

Or dplyr

library(dplyr)
 dd %>% 
    group_by(Episode) %>% 
    mutate(NewEnd=max(AssessmentDate))

Upvotes: 1

MrFlick

Reputation: 206576

Assuming you've properly read in your values as Dates and treated the "." like a NA value in R, your sample data frame should have this structure

dd<-structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L), Episode = c(1L, 
1L, 1L, 2L, 2L, 2L), Start = structure(c(15340, 14610, 14610, 
15706, 15706, 15706), class = "Date"), End = structure(c(15695, 
15695, 15695, NA, NA, NA), class = "Date"), AssessmentDate = structure(c(15340, 
15686, 15695, 15707, 15738, 15766), class = "Date")), .Names = c("ID", 
"Episode", "Start", "End", "AssessmentDate"), row.names = c(NA, 
-6L), class = "data.frame")

You can then calculate the max assessment date per episode using the base ave() function with

dd$NewEnd <- ave(dd$AssessmentDate, dd$Episode, FUN=max)

which gives

  ID Episode      Start        End AssessmentDate     NewEnd
1  1       1 2012-01-01 2012-12-21     2012-01-01 2012-12-21
2  1       1 2010-01-01 2012-12-21     2012-12-12 2012-12-21
3  1       1 2010-01-01 2012-12-21     2012-12-21 2012-12-21
4  1       2 2013-01-01       <NA>     2013-01-02 2013-03-02
5  1       2 2013-01-01       <NA>     2013-02-02 2013-03-02
6  1       2 2013-01-01       <NA>     2013-03-02 2013-03-02

Here I didn't overwrite the existing End values. I wasn't sure what wanted to do in cases where it didn't match.

Upvotes: 1

andybega

Reputation: 1437

You can find the maximum assessment date by episode using ddply() from the plyr library:

df <- data.frame(id=1, Episode=c(1,1,1,2,2,2), AssessmentDate=as.Date(c("2012-01-01", "2012-12-12", "2012-12-21", "2013-01-02", "2013-02-02", "2013-03-02")))

library(plyr)

df <- ddply(df, .(Episode), transform, End=max(AssessmentDate))
df

Which gives you:

  ID Episode AssessmentDate        End
1  1       1     2012-01-01 2012-12-21
2  1       1     2012-12-12 2012-12-21
3  1       1     2012-12-21 2012-12-21
4  1       2     2013-01-02 2013-03-02
5  1       2     2013-02-02 2013-03-02
6  1       2     2013-03-02 2013-03-02

If you want to do it by patient, you can use ddply() with .(ID) (assuming that identifies patients) or something like that.

It's also possible to do this with by(), but becomes a bit more complicated because it will split the data into lists identified by the values of the grouping variable.

Edit: also, if Episode is not unique over the entire data frame, i.e. it repeats for each patient, you could group by both variables, i.e. ddply(df, .(ID, Episode), ...).

Upvotes: 1

Creating Repeated Start and End Dates

Answers (3)

Related Questions