Reputation: 11
I have a dataframe with patientID and dates, sorted on date order within ID. Each patient usually has several lines though it’s possible to only have one. For example:
patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06
From this, I want to produce a dataframe or CSV file with start and end dates for each patient, so from the above, I’d have
patid start end
1302 2009-01-27 2009-08-28
1670 2009-03-12 2009-03-12
2073 2009-04-03 2011-03-06
I have over 30 million rows in the initial file so I'd prefer not to write a for
loop.
I wonder if there is an efficient way to do this, perhaps to start with using aggregate
to derive line counts for each patient?
Upvotes: 0
Views: 46
Reputation: 1369
aggregate()
with FUN = a simple custom function to return a vector of two outputs min()
and max()
in a single step:As you suggested, you can use aggregate()
- but as shown below you can do it in a single step to calculate both min()
and max()
for each patid
group
# Read in your sample data, being careful to prevent dates from becoming factors
pdates <-
read.table( text="patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06",
header=TRUE,
stringsAsFactors=FALSE) # keep date strings from becoming factors!
aggregate( x = pdates["date"], # dataframe with column(s) to aggregate
by = pdates["patid"], # passing dataframe with named column "patid" preserves the column name in the output
FUN = function(vdate) {
c(start=min(vdate), end=max(vdate))
}
)
patid date.start date.end
1 1302 2009-01-27 2009-08-28
2 1670 2009-03-12 2009-03-12
3 2073 2009-04-03 2011-03-06
range()
function:aggregate( pdates["date"], by=pdates["patid"], range)
patid date.1 date.2
1 1302 2009-01-27 2009-08-28
2 1670 2009-03-12 2009-03-12
3 2073 2009-04-03 2011-03-06
Upvotes: 0
Reputation: 3221
Using sqldf
:
input data:
df=read.table(text="patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06",header=T)
Code
library(sqldf)
sqldf("select patid,min(date) as start, max(date) as end from df group by patid")
Output:
patid start end
1 1302 2009-01-27 2009-08-28
2 1670 2009-03-12 2009-03-12
3 2073 2009-04-03 2011-03-06
Upvotes: 1
Reputation: 2707
With tidyverse
:
read.table(text="patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06",header=T)%>%
group_by(patid)%>%
mutate(date=lubridate::ymd(date))%>%
summarise(start=min(date),
end=max(date))
# A tibble: 3 x 3
patid start end
<int> <date> <date>
1 1302 2009-01-27 2009-08-28
2 1670 2009-03-12 2009-03-12
3 2073 2009-04-03 2011-03-06
Upvotes: 1