Reputation: 1746
I've the following data:
ID GROUP DATE
A GR1 12/01/2013
A GR1 09/04/2014
A GR1 01/03/2015
A GR2 04/04/2015
A GR2 08/21/2015
A GR1 01/05/2016
A GR1 06/28/2016
B GR2 11/01/2013
B GR2 06/04/2014
B GR2 04/15/2015
B GR3 11/04/2015
B GR2 03/21/2016
B GR2 07/05/2016
B GR1 06/28/2016
C GR2 01/16/2014
C GR2 06/04/2014
C GR2 04/15/2015
C GR3 11/04/2015
C GR2 03/21/2016
C GR2 06/05/2016
C GR1 06/28/2016
I want to get the difference the person remained in each group. So the new table will look like the following:
ID GROUP DATE Diff
A GR1 12/01/2013
A GR1 09/04/2014
A GR1 01/03/2015 398
A GR2 04/04/2015
A GR2 08/21/2015 139
A GR1 01/05/2016
A GR1 06/28/2016 175
B GR2 11/01/2013
B GR2 06/04/2014
B GR2 04/15/2015 530
B GR3 11/04/2015
B GR2 03/21/2016
B GR2 07/05/2016 106
B GR1 06/28/2016
C GR2 01/16/2014
C GR2 06/04/2014
C GR2 04/15/2015 454
C GR3 11/04/2015
C GR2 03/21/2016
C GR2 01/05/2016 76
C GR1 06/28/2016
THe value in column "Diff" 398 is coming by taking difference '01/03/2015' - '12/1/2013'. Similarly all other difference.
Now my question is how to get this difference? I can't take max(date)-min(date) at each group, because group are repeating at different period. Similarly I can't take first dot and last dot as in SAS.
I'll be extremely grateful if someone help me with the solution. I would prefer the solution in SAS as the data size is very large. SO will not hold in memory.
Regards,
Upvotes: 4
Views: 167
Reputation: 8413
library(dplyr)
library(data.table)
df$xxx = rleidv(df[, c("ID","GROUP"),with = FALSE ])
df$DATE = as.Date(df$DATE, format = "%m/%d/%Y")
df %>% group_by(xxx) %>% mutate(diff = max(DATE) - min(DATE)) %>%
ungroup(xxx) %>% mutate(xxx = NULL)
# ID GROUP DATE diff
# <chr> <chr> <date> <time>
#1 A GR1 2013-12-01 398 days
#2 A GR1 2014-09-04 398 days
#3 A GR1 2015-01-03 398 days
#4 A GR2 2015-04-04 139 days
#5 A GR2 2015-08-21 139 days
#6 A GR1 2016-01-05 175 days
#7 A GR1 2016-06-28 175 days
#8 B GR2 2013-11-01 530 days
#9 B GR2 2014-06-04 530 days
#10 B GR2 2015-04-15 530 days
using only data.table
:
library(data.table)
df[, diff := max(DATE)-min(DATE),by = c("xxx")][,xxx:=NULL]
Upvotes: 6
Reputation: 51566
Doing it with SAS is trivial. Use RETAIN to keep the start date from the first record for the group. Your data does not appeared sorted so either sort it first or if you want to keep the current order (and the records within the groups are already sorted by date) then you can use the NOTSORTED
option on the BY
statement.
data want ;
set have ;
by id group notsorted;
if first.group then start = date ;
else if last.group then diff = date - start ;
retain start;
drop start;
run;
If you need keep the current order, but the dates are not sorted within the groups, then to discover the min and max dates within the group you will need add another variable and a little more logic.
data want ;
set have ;
by id group notsorted;
if first.group then start = date ;
if first.group then stop = date ;
start = min(start,date);
stop = max(stop,date);
if last.group and not first.group then diff = stop - start ;
retain start stop;
drop start stop;
run;
Upvotes: 5
Reputation: 4554
data want(drop=_:);
merge have have(firstobs=2 rename=(id=_id group=_group date=_date));
retain _temp;
_temp= min(_temp,date);
if id^=_id or group^=_group then do;
diff=intck('day',_temp,date);
if diff=0 then call missing(diff);
_temp=_date;
end;
run;
Upvotes: 2