NewUsr
NewUsr

Reputation: 69

Search dates between ranges of periods and count occurrences

suppose to have the following:

data have;
input ID :$20. Start :date9. End :date9.;
format start end ddmmyy9.;
cards;
0001 01JAN2015 30JUN2015 
0001 01JUL2015 01FEB2016 
0001 02FEB2016 11DEC2016 
0001 12DEC2016 06FEB2017 
0001 07FEB2017 31DEC2017 
0002 01JAN2016 31DEC2017 
0002 01JAN2018 01MAR2018
0002 01APR2018 31NOV2018
......................
;          

and a list of dates:

data dates;            
input dates :$20.;              
format dates ddmmyy9.;    
cards; 
01JAN2015 
31DEC2015 
01JAN2016 
31DEC2016 
01JAN2017 
31DEC2017 
01JAN2018 
31DEC2018 
;   

Is there a way to know if, for each ID, each date is in the range? For example: the ID 0001 contains all dates except 01JAN2018 and 31DEC2018. Moreover, for each year I need to count how many IDs start at 01/01 and end at 31/12 so they appear for the entire year. For example, ID 0002 will not be counted for 2018 because it ends before 31/12. Desired output:

ID       01JAN2015  31DEC2015  01JAN2016 31DEC2016 01JAN2017  31DEC2017  01JAN2018   31DEC2018
0001        yes        yes         yes       yes       yes      yes         no         no       
0002         no         no         yes       yes       yes      yes         yes        no       

Final table:

 Year       Count      
 2015         1         
 2016         2
 2017         2       
 2018         0              

To match the dates in the range I tried:

   proc sql;
   create table want as;
   select dates as t1;
   join have as t2;       
   t2.dates between t1.start and t1.end
   order by 1,2;        
   quit;      

Unfortunately I lose the ID correspondence.

Can anyone help me please?

Thank you in advance

Upvotes: 0

Views: 363

Answers (1)

shaun_m
shaun_m

Reputation: 2776

The first output can be achieved using proc sql followed by proc transpose and then a data step.
(The variables are created in a different order than in your desired output - hopefully that is not a problem for you.)

The second can be done with a data step followed by proc summary.

NB - I changed the invalid date 31NOV2018 to 30NOV2018 before running my code.

Output 1

* First, join the HAVE and DATES data sets to check which dates fall within a range;
proc sql;
    create table want1 as
        select h.ID, d.dates format=date9., 'yes' as flag
            from dates d left join have h on (d.dates between h.start and h.end)
            order by ID, dates      
            ;
run;

* transpose this to the wider format required. Cells where there is no match will be blank;
proc transpose data=want1 out=want1 (where=(ID ne '') drop=_name_); 
    by ID;
    id dates;
    var flag;
run;

* Now populate the blank cells with "no";
data want1;
    set want1;
    array datevars (*) _all_;
    do i = 1 to dim(datevars);
        if datevars(i) = '' then datevars(i)='no';
    end;
    drop i;
run;

Output 2


* This assumes the data are in order by ID, start and end - if not then sort the data before this step;
* First read throught the HAVE data set and set count = 1 for each year that is fully covered by each ID;
data want2;
    set have;
    by ID;
    retain first last count; * FIRST and LAST are the earliest START and latest END respectively for each ID;
    if first.ID then do;
        first=start;
        count=0;
    end;
    if last.ID then do;
        last=end;
        do yr=year(first) to year(last); * for each ID we want to cycle through all the years covered;
            if first <= mdy(1,1,yr) and last >= mdy(12,31,yr) then count = 1; * sets count=1 if the year is fully contained within the range;
            output; 
            count=0;    
        end;            
    end;
    drop start end first last;
    format first last date9.;
run;

* The step above produces a row for each combination of year * ID, we just want totals by year which is done by PROC SUMMARY;
proc summary data=want2 nway;
    class yr;
    var count;
    output out=want2 (drop=_type_ _freq_) sum=;
run;

Upvotes: 2

Related Questions