Reputation: 15
I have a wide dataset with 6 dates and 6 types. The dataset has between 2 - 4 million rows so I need the most efficient way to do this. Each type number corresponds to the date number. I need to compare each date and type to each other (2 at a time) and find the earliest date that these requirements are met:
Criteria
If both types = P, the difference between the dates must be 17 or more.
All other combinations must be 24 or more days apart
I have to find the earliest date match. So in my example, my "have" dataset has all 6 entries. My want dataset has the earliest two dates that meet my criteria which are 2 & 4. I can calculate this manually but can't figure out how to do it in SAS. I would love to have this in some sort of iterative macro program instead of taking up hundreds of lines of code.
My logic on paper
My calculations for Day Differences Between:
1 & 2: 1 day (ignore, less than 17)
1 & 3: 7 days (ignore, less than 17)
1 & 4: 18 days (ignore, not both P)
1 & 5: 20 days (ignore, not both P)
1 & 6: 34 days (consider, ge 24)
2 & 3: 6 days (ignore, less than 17)
2 & 4: 17 days (consider, ge 17, both P)
2 & 5: 19 days (consider, ge 17, both P)
2 & 6: 33 days (consider, ge 24)
and so forth
Among all of the sets that met the criteria: 1 and 6, 2 and 4, 2 and 5, 2 and 6. 4 is the earliest so that's what I want.
data have;
input person $
admin_date1 : ?? mmddyy10.
admin_date2 : ??mmddyy10.
admin_date3 : ?? mmddyy10.
admin_date4 : ?? mmddyy10.
admin_date5 : ?? mmddyy10.
admin_date6 : ?? mmddyy10.
type1 $
type2 $
type3 $
type4 $
type5 $
type6 $;
format admin_date1 mmddyy10.
admin_date2 mmddyy10.
admin_date3 mmddyy10.
admin_date4 mmddyy10.
admin_date5 mmddyy10.
admin_date6 mmddyy10.;
datalines;
JohnDoe 01/12/2021 01/13/2021 01/19/2021 01/30/2021 02/01/2021 02/15/2021 M P M P P J
;
run;
data want;
input person $
admin_date1 : ?? mmddyy10.
admin_date2 : ??mmddyy10.
type1 $
type2 $
;
format admin_date1 mmddyy10.
admin_date2 mmddyy10.
;
datalines;
JohnDoe 01/13/2021 01/30/2021 P P
;
run;
Upvotes: 1
Views: 145
Reputation: 63434
This is pretty easy if you transpose to long. Basically just do a second nested set
with a point
that points to the next row and iterates over the other records in the same person. It's not super fast, there are faster ways, but it's quite simple unless you have millions of rows.
data have_long( rename=(_admin_date = admin_date _type = type));
set have;
array admin_date[6];
array type[6];
do _i = 1 to dim(admin_date);
_admin_date = admin_date[_i];
_type = type[_i];
output;
end;
keep person _admin_date _type;
format _admin_date mmddyy10.;
run;
data want;
set have_long nobs=nobs;
if type = 'P';
do _i = _n_ + 1 to nobs until (person ne _person);
set have_long(rename=(person=_person type=_type admin_date=admin_date2)) point=_i;
if person eq _person and admin_date2 ge (admin_date + 17) and _type = 'P' then do;
output;
leave;
end;
end;
drop _:;
run;
Another similar solution, which works so long as have_long
is able to be loaded into memory, is to use a hash table. This might be faster, or might not be, depending on various details of the data.
data want;
if _n_ eq 1 then do; *declare the hash table;
declare hash hl(dataset:'have_long', ordered:'a', multidata:'y'); *ordered ascending and multiple records per key is allowed;
hl.defineKey('person','type');
hl.defineData('admin_date');
hl.defineDone();
end;
format admin_date1 mmddyy8.;
set have_long;
if type = 'P';
admin_date1 = admin_date; *set the original admin_date aside;
rc = hl.find(); *find the first match;
do while (rc eq 0 and admin_date1 gt (admin_date - 17)); *loop unless we either run out of matches or find a valid date;
rc = hl.find_next();
end;
if rc eq 0 then output; *if we exited due to a valid date then output;
run;
Upvotes: 1