Reputation: 816
I have a data set with many variables. Those of interest are: ID, Episode, Start, End, Assessment Date. An example data set is shown
ID Episode Start End AssessmentDate
1 1 1/1/2012 12/21/2012 1/1/2012
1 1 1/1/2010 12/21/2012 12/12/2012
1 1 1/1/2010 12/21/2012 12/21/2012
1 2 1/1/2013 . 1/2/2013
1 2 1/1/2013 . 2/2/2013
1 2 1/1/2013 . 3/2/2013
2 1 1/1/2012 . 4/1/2012
2 1 1/1/2010 . 5/12/2012
2 1 1/1/2010 . 6/21/2012
2 2 1/1/2013 . 7/2/2013
2 2 1/1/2013 . 8/2/2013
2 2 1/1/2013 . 9/2/2013
I have start dates for everyone, but not any end dates. I want to identify the end date for each episode and for each patient, for 10,000 patients. I want the end date to be the last date of assessment per episode number, and I want this to be present for each row between the first and last assessment dates.
I was reading a bit about splitting the data set into many smaller parts based on the ID and Episode, but I feel like there should be a simpler way to do this. I'm new to R, coming from SAS, and an issue like this in SAS would not give me too much trouble.
I would appreciate any input you may have regarding my data preparations.
Upvotes: 0
Views: 84
Reputation: 887911
Or using data.table
(data from @MrFlicks post)
library(data.table)
setDT(dd)[, NewEnd:=max(AssessmentDate), by=Episode]
dd
# ID Episode Start End AssessmentDate NewEnd
#1: 1 1 2012-01-01 2012-12-21 2012-01-01 2012-12-21
#2: 1 1 2010-01-01 2012-12-21 2012-12-12 2012-12-21
#3: 1 1 2010-01-01 2012-12-21 2012-12-21 2012-12-21
#4: 1 2 2013-01-01 <NA> 2013-01-02 2013-03-02
#5: 1 2 2013-01-01 <NA> 2013-02-02 2013-03-02
#6: 1 2 2013-01-01 <NA> 2013-03-02 2013-03-02
Or dplyr
library(dplyr)
dd %>%
group_by(Episode) %>%
mutate(NewEnd=max(AssessmentDate))
Upvotes: 1
Reputation: 206576
Assuming you've properly read in your values as Dates and treated the "." like a NA
value in R, your sample data frame should have this structure
dd<-structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L), Episode = c(1L,
1L, 1L, 2L, 2L, 2L), Start = structure(c(15340, 14610, 14610,
15706, 15706, 15706), class = "Date"), End = structure(c(15695,
15695, 15695, NA, NA, NA), class = "Date"), AssessmentDate = structure(c(15340,
15686, 15695, 15707, 15738, 15766), class = "Date")), .Names = c("ID",
"Episode", "Start", "End", "AssessmentDate"), row.names = c(NA,
-6L), class = "data.frame")
You can then calculate the max assessment date per episode using the base ave()
function with
dd$NewEnd <- ave(dd$AssessmentDate, dd$Episode, FUN=max)
which gives
ID Episode Start End AssessmentDate NewEnd
1 1 1 2012-01-01 2012-12-21 2012-01-01 2012-12-21
2 1 1 2010-01-01 2012-12-21 2012-12-12 2012-12-21
3 1 1 2010-01-01 2012-12-21 2012-12-21 2012-12-21
4 1 2 2013-01-01 <NA> 2013-01-02 2013-03-02
5 1 2 2013-01-01 <NA> 2013-02-02 2013-03-02
6 1 2 2013-01-01 <NA> 2013-03-02 2013-03-02
Here I didn't overwrite the existing End values. I wasn't sure what wanted to do in cases where it didn't match.
Upvotes: 1
Reputation: 1437
You can find the maximum assessment date by episode using ddply()
from the plyr
library:
df <- data.frame(id=1, Episode=c(1,1,1,2,2,2), AssessmentDate=as.Date(c("2012-01-01", "2012-12-12", "2012-12-21", "2013-01-02", "2013-02-02", "2013-03-02")))
library(plyr)
df <- ddply(df, .(Episode), transform, End=max(AssessmentDate))
df
Which gives you:
ID Episode AssessmentDate End
1 1 1 2012-01-01 2012-12-21
2 1 1 2012-12-12 2012-12-21
3 1 1 2012-12-21 2012-12-21
4 1 2 2013-01-02 2013-03-02
5 1 2 2013-02-02 2013-03-02
6 1 2 2013-03-02 2013-03-02
If you want to do it by patient, you can use ddply()
with .(ID)
(assuming that identifies patients) or something like that.
It's also possible to do this with by()
, but becomes a bit more complicated because it will split the data into lists identified by the values of the grouping variable.
Edit: also, if Episode
is not unique over the entire data frame, i.e. it repeats for each patient, you could group by both variables, i.e. ddply(df, .(ID, Episode), ...)
.
Upvotes: 1