Reputation: 592

proc SQL - join duplicating values

I have two datasets:

data snippet1;
  input ID callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured ;
  format Date start_date end_date  ddmmyy10.;
datalines4;
001 30/11/2020  28/11/2020  01/12/2020  Cured
001 01/12/2020  28/11/2020  01/12/2020  Cured
001 30/12/2020  28/12/2020  04/01/2021  Not Cured
001 31/12/2020  28/12/2020  04/01/2021  Not Cured
001 01/02/2021  28/01/2021  01/02/2021  Cured
    ;;;;

data have1;
  input ID  event_date :ddmmyy. description ;
  format  event_date ddmmyy10.;
datalines4;
001 28Oct2020   
001 29Nov2020   
001 29Nov2020   New Plan
001 30Nov2020   
001 01Dec2020   
001 01Dec2020   New Plan
001 01Dec2020   Stop Category
001 01Dec2020   Review Date
001 02Dec2020   
001 02Dec2020   OLd Contact Strategy Level
001 02Dec2020   
001 04Dec2020   Stop Category
001 04Dec2020   Review Date
001 29Dec2020   
001 29Dec2020   New Plan
001 30Dec2020   
001 31Dec2020   
001 01Jan2021   
001 01Jan2021   
001 02Jan2021   
001 04Jan2021   
001 05Jan2021   OLd Contact Strategy Level
001 05Jan2021   
001 29Jan2021   
001 29Jan2021   New Plan
001 30Jan2021   
001 31Jan2021   
001 01Feb2021   
001 01Feb2021   
001 02Feb2021   
001 02Feb2021   OLd Contact Strategy Level
001 02Feb2021
    ;;;;

I'm trying to get, basically Snippet1 with an two column called Description1 and Description2 which will grab the first and last description, respectively, of each period between Calldate + Calldate+2. So for Calldate = 02Dec2020, Description1 = OLd Contact Strategy Level and Description2= Review Date

I have more ID's of course, but I think just with one it is enough to see my problem anyways.

This is the code I have so far:

proc sql;
create table want as
select a.*
, min(c.description) as description1
, max(c.description) as description2
from snippet1 a
inner join
        have1 c
on a.id= c.id
and a.calldate<= c.event_date
and c.event_date <= a.calldate+ 2
Group by 1;
Quit;

But this is the outcome:

data snippet1;
      input ID callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured description1 description2;
      format Date start_date end_date  ddmmyy10.;
    datalines4;
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 30/11/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 31/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 31/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 01/02/2021  28/01/2021  01/02/2021  Cured   New Plan    Stop Category
001 01/02/2021  28/01/2021  01/02/2021  Cured   New Plan    Stop Category
001 01/02/2021  28/01/2021  01/02/2021  Cured   New Plan    Stop Category
001 01/12/2020  28/11/2020  01/12/2020  Cured   New Plan    Stop Category
001 31/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 31/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 30/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 30/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
001 01/02/2021  28/01/2021  01/02/2021  Cured   New Plan    Stop Category
001 01/02/2021  28/01/2021  01/02/2021  Cured   New Plan    Stop Category
001 30/12/2020  28/12/2020  04/01/2021  Not Cured   New Plan    Stop Category
;;;;

As you can see, the dates repeat themselves a couple of times, and I'm not even sure all call dates are even properly in there.

Does anyone have any idea?

Upvotes: 0

Answers (5)

whymath

Reputation: 1394

This is so called table lookup problem. I recommend you the double set skill. It is very easy to learn.
Incidentally, I have fixed several mistakes in your data input step.

data snippet1;
  input ID$ callDate :ddmmyy. start_date :ddmmyy. end_date :ddmmyy. cured$13. ;
  format callDate start_date end_date  ddmmyy10.;
datalines4;
001 30/11/2020  28/11/2020  01/12/2020  Cured
001 01/12/2020  28/11/2020  01/12/2020  Cured
001 30/12/2020  28/12/2020  04/01/2021  Not Cured
001 31/12/2020  28/12/2020  04/01/2021  Not Cured
001 01/02/2021  28/01/2021  01/02/2021  Cured
;;;;
run;

data have1;
  input ID$  event_date :date9. description $42. ;
  format  event_date ddmmyy10.;
datalines4;
001 28Oct2020   
001 29Nov2020   
001 29Nov2020   New Plan
001 30Nov2020   
001 01Dec2020   
001 01Dec2020   New Plan
001 01Dec2020   Stop Category
001 01Dec2020   Review Date
001 02Dec2020   
001 02Dec2020   OLd Contact Strategy Level
001 02Dec2020   
001 04Dec2020   Stop Category
001 04Dec2020   Review Date
001 29Dec2020   
001 29Dec2020   New Plan
001 30Dec2020   
001 31Dec2020   
001 01Jan2021   
001 01Jan2021   
001 02Jan2021   
001 04Jan2021   
001 05Jan2021   OLd Contact Strategy Level
001 05Jan2021   
001 29Jan2021   
001 29Jan2021   New Plan
001 30Jan2021   
001 31Jan2021   
001 01Feb2021   
001 01Feb2021   
001 02Feb2021   
001 02Feb2021   OLd Contact Strategy Level
001 02Feb2021
;;;;
run;

data want1;
  length description1 description2 $42.;
  set snippet1;

  do i = 1 to rec;
    set have1(rename=ID=TmpID)nobs=rec point=i;
    if ID=TmpID and callDate <= event_date <= callDate + 2 then do;
      if description1 = '' then description1 = description;
      if description ^= '' then description2 = description;
    end;
  end;
  drop Tmp:;
run;

There is an excellent article about double set that you may want to read:
Multiple Set Statements in a Data Step: A Powerful Technique for Combining and Aggregating Complex Data

Upvotes: 2

Tom

Reputation: 51566

Do it in two steps. First find the descriptions that fall into the date ranges and sort them by event_date.

proc sql ;
create table list as 
  select a.id,a.calldate,a.start_date,a.end_date,a.cured
       , b.event_date,b.description
  from snippet1 a left join have1 b
  on a.id=b.id 
    and a.calldate<= b.event_date
    and b.event_date <= a.calldate+ 2
    and b.description is not null
  order by a.id,a.calldate,a.start_date,a.end_date,a.cured,b.event_date 
;
quit;

Then process the list to reduce to first and last.

data want;
  set list;
  by id calldate start_date end_date cured ;
  length description1 description2 $42  ;
  if first.cured then do;
     event_date1=event_date;
     description1=description;
  end;
  retain event_date1 description1;
  if last.cured then do;
    if not first.cured then do;
      description2=description;
      event_date2=event_date;
    end;
    output;
  end;
  drop description event_date;
  format event_date1 event_date2 yymmdd10.;
run;

Results:

Upvotes: 1

Joe

Reputation: 63424

I've taken @whymath's solution and worked it to be a bit faster; this is probably only necessary if you have GBs of data or need very high performance, though.

First, we construct a dataset that stores the first row number for each event_date in have1. We'll use that in the second data step to guide our hand when we retrieve rows from have1 so we do not need to iterate needlessly. We also create an index here to enable keyed set in the next step.

Second, we use that with the key option to retrieve that starting row, and then use that in the point loop instead of 1. We also add leave here to allow us to stop iterating when we're past the mark.

These all assume the dataset is in the order you want - but I think we have to assume that or your whole idea goes awry. Do make sure though that it's properly in order, or you will have issues.

data have1_ids(index=(id_calldate=(id calldate)));
  set have1;
  rename event_Date=calldate;
  by id event_date;
  _row+1;   *This keeps track of the row number only;
  if first.event_date;
  keep id event_date   _row;
run;


data want1;
  length description1 description2 $42.;
  set snippet1;
  set have1_ids key=id_calldate;
  do i = _row to rec;                                            *now we can start on the right row;
    set have1(rename=ID=_ID) nobs=rec point=i;              
    if (event_date gt calldate+2) or (ID ne _ID) then leave;     *conditions to exit the loop - if either of these is true then we are done here;
    if missing(description1) then description1 = description;    *populate the earlier description once we hit a valid description;
    if not missing(description) then description2 = description; *keep rewriting this until the end;
  end;
  drop _:;
run;

Note I do not check for if _IORC_ eq 0 here like I do in the other keyed set answer - that's because I don't care much if it fails; if it doesn't find a matching row, then the prior value for row is okay to use. It's not optimal, but it's close - and there's no good way to get the next row.

Upvotes: 1

Joe

Reputation: 63424

Here's one more way to do it, which might be the most performant in some cases where the have1 is relatively large. It's a bit less flexible though in some ways.

This uses the keyed set to do all of the work. It takes have1, and makes three copies of each row - one for each date you want it to qualify for. Then, keyed set simply grabs the rows that are on the right date. Keyed set uses an index on the set dataset to find matching rows by the index.

data have1_expanded(index=(id_calldate=(id calldate)));
  set have1;
  if not missing(description);
  format calldate date9.;
  do calldate=event_date to event_date-2 by -1;
    output;
  end;
run;

data want1;
  set snippet1;
  do _n_ = 1 by 1 until (_IORC_ ne 0);  *technically pointless but I always include it to make sure I do not forget _IORC_;
    set have1_expanded key=id_calldate end=eof;
    if _IORC_ ne 0 then leave;          *as keyed set iterates, _IORC_ will be zero when it finds a match and nonzero when it does not find any more matches;
    if _n_ eq 1 then description1=description;  *first time through, grab that first description;
    description2=description;                   *every time through, overwrite this to get the last description;
  end;
run;

Upvotes: 1

Joe

Reputation: 63424

You can't get what you want with pure SQL, or at least not with some modifications to your data. SQL doesn't respect "order", and particularly not SAS's implementation of SQL (which doesn't allow for mid-query ordering). So your request to get the "last" row doesn't fly: all rows with calldate+2 are equivalent as far as SQL is concerned, and you could get any of them effectively at random if you successfully asked for that. (It wouldn't actually be random, but you should treat it as if it were for the purposes of writing code - only do that if you really don't care which you get.)

To do this in SQL, you must add in an ordering field. That's technically possible "in line" but not recommended (using monotonic()) as it is an undocumented function. Better is to add it in with a data step view.

First:

data have1_v/view=have1_v;
    set have1;
    by id;
    if first.id then id_row=0;
    id_row+1;
run;

This establishes the order. Then:

proc sql;
  select snippet1.*, 
    
    ( select description from have1_v where have1_v.id=snippet1.id
                                        and have1_v.event_Date between snippet1.calldate and snippet1.calldate+2
                                        and have1_v.description is not null
                                    having have1_v.id_row = min(have1_v.id_row)
    )as min_descript,
    ( select description from have1_v where have1_v.id=snippet1.id
                                        and have1_v.event_Date between snippet1.calldate and snippet1.calldate+2
                                        and have1_v.description is not null
                                    having have1_v.id_row = max(have1_v.id_row)
    )as max_descript
    from snippet1;
quit;

This grabs your 'min' and 'max' description. I think this returns what you ask for, though there is no row in snippet1 that matches the 12/2 date you mention in the question.

All this said, SAS has better tools for things like this where order matters. The SAS data step does have a solution for example: it guarantees order, assuming you don't muck with it yourself. See for example how you could create a summarized dataset:

data have1_summarized;
  set have1;
  by id event_date;
  where not missing(description);
  retain min_description max_description;
  if first.event_date then min_description = description;
  max_description=description;
  if last.event_date then output;
run;

Now you can use SQL or other tools to combine this back with the snippet1 dataset, as you no longer have duplicate event dates, so ordering no longer matters.

Upvotes: 2

proc SQL - join duplicating values

Answers (5)

Related Questions