user7415409
user7415409

Reputation: 11

Summarising date ranges

I have a dataframe with patientID and dates, sorted on date order within ID. Each patient usually has several lines though it’s possible to only have one. For example:

patid   date
1302    2009-01-27
1302    2009-02-05
1302    2009-08-28
1670    2009-03-12
2073    2009-04-03
2073    2010-11-01
2073    2010-12-19
2073    2011-03-06

From this, I want to produce a dataframe or CSV file with start and end dates for each patient, so from the above, I’d have

patid   start       end
1302    2009-01-27  2009-08-28
1670    2009-03-12  2009-03-12
2073    2009-04-03  2011-03-06

I have over 30 million rows in the initial file so I'd prefer not to write a for loop.

I wonder if there is an efficient way to do this, perhaps to start with using aggregate to derive line counts for each patient?

Upvotes: 0

Views: 46

Answers (3)

krads
krads

Reputation: 1369

Using base R function aggregate() with FUN = a simple custom function to return a vector of two outputs min() and max() in a single step:

As you suggested, you can use aggregate() - but as shown below you can do it in a single step to calculate both min() and max() for each patid group

# Read in your sample data, being careful to prevent dates from becoming factors
pdates <- 
  read.table( text="patid   date
                    1302    2009-01-27
                    1302    2009-02-05
                    1302    2009-08-28
                    1670    2009-03-12
                    2073    2009-04-03
                    2073    2010-11-01
                    2073    2010-12-19
                    2073    2011-03-06",
                    header=TRUE, 
                    stringsAsFactors=FALSE) # keep date strings from becoming factors!

aggregate( x = pdates["date"],   # dataframe with column(s) to aggregate
           by = pdates["patid"], # passing dataframe with named column "patid" preserves the column name in the output
           FUN = function(vdate) { 
                   c(start=min(vdate), end=max(vdate))
                 }  
         )

  patid date.start   date.end
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06

EDIT: Or, even more simply using the very useful base R range() function:

aggregate( pdates["date"], by=pdates["patid"], range)

  patid     date.1     date.2
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06

Upvotes: 0

Saurabh Chauhan
Saurabh Chauhan

Reputation: 3221

Using sqldf:

input data:

df=read.table(text="patid   date
          1302    2009-01-27
          1302    2009-02-05
          1302    2009-08-28
          1670    2009-03-12
          2073    2009-04-03
          2073    2010-11-01
          2073    2010-12-19
          2073    2011-03-06",header=T)

Code

 library(sqldf)
 sqldf("select patid,min(date) as start, max(date) as end from df group by patid")

Output:

   patid      start        end
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06

Upvotes: 1

jyjek
jyjek

Reputation: 2707

With tidyverse:

read.table(text="patid   date
           1302    2009-01-27
           1302    2009-02-05
           1302    2009-08-28
           1670    2009-03-12
           2073    2009-04-03
           2073    2010-11-01
           2073    2010-12-19
           2073    2011-03-06",header=T)%>%
   group_by(patid)%>%
   mutate(date=lubridate::ymd(date))%>%
   summarise(start=min(date),
             end=max(date))
# A tibble: 3 x 3
  patid start      end       
  <int> <date>     <date>    
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06

Upvotes: 1

Related Questions