Reputation: 592
I have two datasets:
data snippet1;
input ID callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured ;
format Date start_date end_date ddmmyy10.;
datalines4;
001 30/11/2020 28/11/2020 01/12/2020 Cured
001 01/12/2020 28/11/2020 01/12/2020 Cured
001 30/12/2020 28/12/2020 04/01/2021 Not Cured
001 31/12/2020 28/12/2020 04/01/2021 Not Cured
001 01/02/2021 28/01/2021 01/02/2021 Cured
;;;;
data have1;
input ID event_date :ddmmyy. description ;
format event_date ddmmyy10.;
datalines4;
001 28Oct2020
001 29Nov2020
001 29Nov2020 New Plan
001 30Nov2020
001 01Dec2020
001 01Dec2020 New Plan
001 01Dec2020 Stop Category
001 01Dec2020 Review Date
001 02Dec2020
001 02Dec2020 OLd Contact Strategy Level
001 02Dec2020
001 04Dec2020 Stop Category
001 04Dec2020 Review Date
001 29Dec2020
001 29Dec2020 New Plan
001 30Dec2020
001 31Dec2020
001 01Jan2021
001 01Jan2021
001 02Jan2021
001 04Jan2021
001 05Jan2021 OLd Contact Strategy Level
001 05Jan2021
001 29Jan2021
001 29Jan2021 New Plan
001 30Jan2021
001 31Jan2021
001 01Feb2021
001 01Feb2021
001 02Feb2021
001 02Feb2021 OLd Contact Strategy Level
001 02Feb2021
;;;;
I'm trying to get, basically Snippet1
with an two column called Description1
and Description2
which will grab the first and last description, respectively, of each period between Calldate + Calldate+2
. So for Calldate = 02Dec2020, Description1 = OLd Contact Strategy Level and Description2= Review Date
I have more ID's of course, but I think just with one it is enough to see my problem anyways.
This is the code I have so far:
proc sql;
create table want as
select a.*
, min(c.description) as description1
, max(c.description) as description2
from snippet1 a
inner join
have1 c
on a.id= c.id
and a.calldate<= c.event_date
and c.event_date <= a.calldate+ 2
Group by 1;
Quit;
But this is the outcome:
data snippet1;
input ID callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured description1 description2;
format Date start_date end_date ddmmyy10.;
datalines4;
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 30/11/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 31/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 31/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 01/02/2021 28/01/2021 01/02/2021 Cured New Plan Stop Category
001 01/02/2021 28/01/2021 01/02/2021 Cured New Plan Stop Category
001 01/02/2021 28/01/2021 01/02/2021 Cured New Plan Stop Category
001 01/12/2020 28/11/2020 01/12/2020 Cured New Plan Stop Category
001 31/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 31/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 30/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 30/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
001 01/02/2021 28/01/2021 01/02/2021 Cured New Plan Stop Category
001 01/02/2021 28/01/2021 01/02/2021 Cured New Plan Stop Category
001 30/12/2020 28/12/2020 04/01/2021 Not Cured New Plan Stop Category
;;;;
As you can see, the dates repeat themselves a couple of times, and I'm not even sure all call dates are even properly in there.
Does anyone have any idea?
Upvotes: 0
Views: 462
Reputation: 1394
This is so called table lookup
problem. I recommend you the double set
skill. It is very easy to learn.
Incidentally, I have fixed several mistakes in your data input step.
data snippet1;
input ID$ callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured$13. ;
format callDate start_date end_date ddmmyy10.;
datalines4;
001 30/11/2020 28/11/2020 01/12/2020 Cured
001 01/12/2020 28/11/2020 01/12/2020 Cured
001 30/12/2020 28/12/2020 04/01/2021 Not Cured
001 31/12/2020 28/12/2020 04/01/2021 Not Cured
001 01/02/2021 28/01/2021 01/02/2021 Cured
;;;;
run;
data have1;
input ID$ event_date :date9. description $42. ;
format event_date ddmmyy10.;
datalines4;
001 28Oct2020
001 29Nov2020
001 29Nov2020 New Plan
001 30Nov2020
001 01Dec2020
001 01Dec2020 New Plan
001 01Dec2020 Stop Category
001 01Dec2020 Review Date
001 02Dec2020
001 02Dec2020 OLd Contact Strategy Level
001 02Dec2020
001 04Dec2020 Stop Category
001 04Dec2020 Review Date
001 29Dec2020
001 29Dec2020 New Plan
001 30Dec2020
001 31Dec2020
001 01Jan2021
001 01Jan2021
001 02Jan2021
001 04Jan2021
001 05Jan2021 OLd Contact Strategy Level
001 05Jan2021
001 29Jan2021
001 29Jan2021 New Plan
001 30Jan2021
001 31Jan2021
001 01Feb2021
001 01Feb2021
001 02Feb2021
001 02Feb2021 OLd Contact Strategy Level
001 02Feb2021
;;;;
run;
data want1;
length description1 description2 $42.;
set snippet1;
do i = 1 to rec;
set have1(rename=ID=TmpID)nobs=rec point=i;
if ID=TmpID and callDate <= event_date <= callDate + 2 then do;
if description1 = '' then description1 = description;
if description ^= '' then description2 = description;
end;
end;
drop Tmp:;
run;
There is an excellent article about double set
that you may want to read:
Multiple Set Statements in a Data Step: A Powerful Technique for Combining and Aggregating Complex Data
Upvotes: 2
Reputation: 51566
Do it in two steps. First find the descriptions that fall into the date ranges and sort them by event_date.
proc sql ;
create table list as
select a.id,a.calldate,a.start_date,a.end_date,a.cured
, b.event_date,b.description
from snippet1 a left join have1 b
on a.id=b.id
and a.calldate<= b.event_date
and b.event_date <= a.calldate+ 2
and b.description is not null
order by a.id,a.calldate,a.start_date,a.end_date,a.cured,b.event_date
;
quit;
Then process the list to reduce to first and last.
data want;
set list;
by id calldate start_date end_date cured ;
length description1 description2 $42 ;
if first.cured then do;
event_date1=event_date;
description1=description;
end;
retain event_date1 description1;
if last.cured then do;
if not first.cured then do;
description2=description;
event_date2=event_date;
end;
output;
end;
drop description event_date;
format event_date1 event_date2 yymmdd10.;
run;
Results:
Upvotes: 1
Reputation: 63424
I've taken @whymath's solution and worked it to be a bit faster; this is probably only necessary if you have GBs of data or need very high performance, though.
First, we construct a dataset that stores the first row number for each event_date in have1
. We'll use that in the second data step to guide our hand when we retrieve rows from have1
so we do not need to iterate needlessly. We also create an index here to enable keyed set in the next step.
Second, we use that with the key
option to retrieve that starting row, and then use that in the point
loop instead of 1
. We also add leave
here to allow us to stop iterating when we're past the mark.
These all assume the dataset is in the order you want - but I think we have to assume that or your whole idea goes awry. Do make sure though that it's properly in order, or you will have issues.
data have1_ids(index=(id_calldate=(id calldate)));
set have1;
rename event_Date=calldate;
by id event_date;
_row+1; *This keeps track of the row number only;
if first.event_date;
keep id event_date _row;
run;
data want1;
length description1 description2 $42.;
set snippet1;
set have1_ids key=id_calldate;
do i = _row to rec; *now we can start on the right row;
set have1(rename=ID=_ID) nobs=rec point=i;
if (event_date gt calldate+2) or (ID ne _ID) then leave; *conditions to exit the loop - if either of these is true then we are done here;
if missing(description1) then description1 = description; *populate the earlier description once we hit a valid description;
if not missing(description) then description2 = description; *keep rewriting this until the end;
end;
drop _:;
run;
Note I do not check for if _IORC_ eq 0
here like I do in the other keyed set answer - that's because I don't care much if it fails; if it doesn't find a matching row, then the prior value for row
is okay to use. It's not optimal, but it's close - and there's no good way to get the next row.
Upvotes: 1
Reputation: 63424
Here's one more way to do it, which might be the most performant in some cases where the have1
is relatively large. It's a bit less flexible though in some ways.
This uses the keyed set to do all of the work. It takes have1
, and makes three copies of each row - one for each date you want it to qualify for. Then, keyed set simply grabs the rows that are on the right date. Keyed set uses an index on the set
dataset to find matching rows by the index.
data have1_expanded(index=(id_calldate=(id calldate)));
set have1;
if not missing(description);
format calldate date9.;
do calldate=event_date to event_date-2 by -1;
output;
end;
run;
data want1;
set snippet1;
do _n_ = 1 by 1 until (_IORC_ ne 0); *technically pointless but I always include it to make sure I do not forget _IORC_;
set have1_expanded key=id_calldate end=eof;
if _IORC_ ne 0 then leave; *as keyed set iterates, _IORC_ will be zero when it finds a match and nonzero when it does not find any more matches;
if _n_ eq 1 then description1=description; *first time through, grab that first description;
description2=description; *every time through, overwrite this to get the last description;
end;
run;
Upvotes: 1
Reputation: 63424
You can't get what you want with pure SQL, or at least not with some modifications to your data. SQL doesn't respect "order", and particularly not SAS's implementation of SQL (which doesn't allow for mid-query ordering). So your request to get the "last" row doesn't fly: all rows with calldate+2 are equivalent as far as SQL is concerned, and you could get any of them effectively at random if you successfully asked for that. (It wouldn't actually be random, but you should treat it as if it were for the purposes of writing code - only do that if you really don't care which you get.)
To do this in SQL, you must add in an ordering field. That's technically possible "in line" but not recommended (using monotonic()
) as it is an undocumented function. Better is to add it in with a data step view.
First:
data have1_v/view=have1_v;
set have1;
by id;
if first.id then id_row=0;
id_row+1;
run;
This establishes the order. Then:
proc sql;
select snippet1.*,
( select description from have1_v where have1_v.id=snippet1.id
and have1_v.event_Date between snippet1.calldate and snippet1.calldate+2
and have1_v.description is not null
having have1_v.id_row = min(have1_v.id_row)
)as min_descript,
( select description from have1_v where have1_v.id=snippet1.id
and have1_v.event_Date between snippet1.calldate and snippet1.calldate+2
and have1_v.description is not null
having have1_v.id_row = max(have1_v.id_row)
)as max_descript
from snippet1;
quit;
This grabs your 'min' and 'max' description. I think this returns what you ask for, though there is no row in snippet1
that matches the 12/2 date you mention in the question.
All this said, SAS has better tools for things like this where order matters. The SAS data step does have a solution for example: it guarantees order, assuming you don't muck with it yourself. See for example how you could create a summarized dataset:
data have1_summarized;
set have1;
by id event_date;
where not missing(description);
retain min_description max_description;
if first.event_date then min_description = description;
max_description=description;
if last.event_date then output;
run;
Now you can use SQL or other tools to combine this back with the snippet1
dataset, as you no longer have duplicate event dates, so ordering no longer matters.
Upvotes: 2