Beta
Beta

Reputation: 1746

Finding the Date Difference

I've the following data:

ID GROUP     DATE 
A   GR1   12/01/2013
A   GR1   09/04/2014
A   GR1   01/03/2015
A   GR2   04/04/2015
A   GR2   08/21/2015
A   GR1   01/05/2016
A   GR1   06/28/2016
B   GR2   11/01/2013
B   GR2   06/04/2014
B   GR2   04/15/2015
B   GR3   11/04/2015
B   GR2   03/21/2016
B   GR2   07/05/2016
B   GR1   06/28/2016
C   GR2   01/16/2014
C   GR2   06/04/2014
C   GR2   04/15/2015
C   GR3   11/04/2015
C   GR2   03/21/2016
C   GR2   06/05/2016
C   GR1   06/28/2016

I want to get the difference the person remained in each group. So the new table will look like the following:

ID GROUP     DATE      Diff
A   GR1   12/01/2013
A   GR1   09/04/2014
A   GR1   01/03/2015    398
A   GR2   04/04/2015
A   GR2   08/21/2015    139
A   GR1   01/05/2016
A   GR1   06/28/2016    175
B   GR2   11/01/2013
B   GR2   06/04/2014
B   GR2   04/15/2015    530
B   GR3   11/04/2015
B   GR2   03/21/2016
B   GR2   07/05/2016    106
B   GR1   06/28/2016
C   GR2   01/16/2014
C   GR2   06/04/2014    
C   GR2   04/15/2015    454
C   GR3   11/04/2015
C   GR2   03/21/2016
C   GR2   01/05/2016    76
C   GR1   06/28/2016

THe value in column "Diff" 398 is coming by taking difference '01/03/2015' - '12/1/2013'. Similarly all other difference.

Now my question is how to get this difference? I can't take max(date)-min(date) at each group, because group are repeating at different period. Similarly I can't take first dot and last dot as in SAS.

I'll be extremely grateful if someone help me with the solution. I would prefer the solution in SAS as the data size is very large. SO will not hold in memory.

Regards,

Upvotes: 4

Views: 167

Answers (3)

joel.wilson
joel.wilson

Reputation: 8413

library(dplyr)
library(data.table)
df$xxx = rleidv(df[, c("ID","GROUP"),with = FALSE ])
df$DATE = as.Date(df$DATE, format = "%m/%d/%Y")
df %>% group_by(xxx) %>% mutate(diff = max(DATE) - min(DATE)) %>%
       ungroup(xxx) %>% mutate(xxx = NULL)
#     ID GROUP       DATE     diff
#   <chr> <chr>     <date>   <time>
#1      A   GR1 2013-12-01 398 days
#2      A   GR1 2014-09-04 398 days
#3      A   GR1 2015-01-03 398 days
#4      A   GR2 2015-04-04 139 days
#5      A   GR2 2015-08-21 139 days
#6      A   GR1 2016-01-05 175 days
#7      A   GR1 2016-06-28 175 days
#8      B   GR2 2013-11-01 530 days
#9      B   GR2 2014-06-04 530 days
#10     B   GR2 2015-04-15 530 days

using only data.table:

library(data.table)
df[, diff := max(DATE)-min(DATE),by = c("xxx")][,xxx:=NULL]

Upvotes: 6

Tom
Tom

Reputation: 51566

Doing it with SAS is trivial. Use RETAIN to keep the start date from the first record for the group. Your data does not appeared sorted so either sort it first or if you want to keep the current order (and the records within the groups are already sorted by date) then you can use the NOTSORTED option on the BY statement.

data want ;
  set have ;
  by id group notsorted;
  if first.group then start = date ;
  else if last.group then diff = date - start ;
  retain start;
  drop start;
run;

If you need keep the current order, but the dates are not sorted within the groups, then to discover the min and max dates within the group you will need add another variable and a little more logic.

data want ;
  set have ;
  by id group notsorted;
  if first.group then start = date ;
  if first.group then stop = date ;
  start = min(start,date);
  stop = max(stop,date);
  if last.group and not first.group then diff = stop - start ;
  retain start stop;
  drop start stop;
run;

Upvotes: 5

Shenglin Chen
Shenglin Chen

Reputation: 4554

data want(drop=_:);
   merge have have(firstobs=2 rename=(id=_id group=_group date=_date));
   retain _temp;
   _temp= min(_temp,date);
   if id^=_id or group^=_group then do;
       diff=intck('day',_temp,date);
       if diff=0 then call missing(diff);
       _temp=_date;
   end;
run;

Upvotes: 2

Related Questions