Ezio_Auditore
Ezio_Auditore

Reputation: 43

Checking if the date lie between two dates from a different column based on a unique id (account_number)

I have a dataset which has an account number a billing cycle date and transaction date . Every account should have first two occurrences for the billing cycle date and then the transaction date has to checked if it lies between those two billing cycle dates.

Data that I have .

 input acct_num   biiling_date tran_date ;
    DATALINES; 
    1111      23FEB2020  27FEB2020
    1111      23MAR2020  27FEB2020
    2222      17MAR2020  22MAR2020
    2222      27APR2020  22MAR2020
    2222      27MAY2020  22MAR2020
    3333      23JUL2018  02JUN2022
    3333      23AUG2018  02JUN2022

Data that I want.

    1111      23FEB2020  27FEB2020
    1111      23MAR2020  27FEB2020
    2222      17MAR2020  22MAR2020
    2222      27APR2020  22MAR2020

If you see in the final dataset, third occurrence of account 2222 has been removed and entire rows of account 3333 has been removed because no tran_date was between the first and second billing cycle date for that account.

Thanks.

Upvotes: 0

Views: 161

Answers (1)

Stu Sztukowski
Stu Sztukowski

Reputation: 12909

The main problem here is that you want to look ahead one row for each group, then make the decision to keep or drop both rows simultaneously. We can make this easier by transposing the data into a long format then merging it back with the original data:

proc transpose data=have out=have_tpose(keep=acct_num COL1-COL2);
    by acct_num;
    var billing_date;
run;
acct_num    COL1        COL2
1111        23FEB2020   23MAR2020
2222        17MAR2020   27APR2020
3333        23JUL2018   23AUG2018

All we care about are the first two rows. COL1 and COL2 represent those first two rows.

Now we can merge this back to our original dataset and check if the transaction date is between COL1 and COL2 for every acct_num. Since we only need to keep the first two rows of data, we're going to count the number of rows, n, and keep track of it for every group. At the start of each group we'll reset the counter.

We will only output if:

  1. n is < 3
  2. tran_date is between COL1 and COl2
data want;
    merge have
          have_tpose
    ;
    by acct_num;

    if(first.acct_num) then n = 0;

    n+1;
    
    if(n < 3 AND COL1 < tran_date < COL2);

    drop n COL1-COL2;
run;
acct_num    billing_date    tran_date
1111        23FEB2020       27FEB2020
1111        23MAR2020       27FEB2020
2222        17MAR2020       22MAR2020
2222        27APR2020       22MAR2020

To do this with lags, you'll need to still do this in two steps: first to create a list of values to drop, and second to create the final filtered list.

data drop_list;
    set have;
    by acct_num;

    lag_billing_date = lag(billing_date);

    if(first.acct_num) then do;
        n = 0;
        call missing(lag_billing_date);
    end;

    n+1;

    if(n = 2) AND NOT (lag_billing_date LE tran_date LE billing_date) 
        then drop_flag = 1;

    if(drop_flag);

    keep acct_num;
run;

data want;
    merge have
          drop_list(in=droplist)
    ;
    by acct_num;

    if(first.acct_num) then n = 0;

    n+1;

    if(n < 3 AND NOT droplist);
run;

If you want a really crazy way of doing this in one step with hash tables, here's that method:

data want;
    set have;
    by acct_num;
    retain drop_flag;

    /* Load the dataset into memory so we can search ahead with hash methods */
    if(_N_ = 1) then do;
        format billing_date2
               billing_date_start
               billing_date_end   date9.
        ;

        dcl hash h(dataset: 'have(rename=(billing_date = billing_date2))', multidata: 'yes');
            h.defineKey('acct_num');
            h.defineData('billing_date2');
        h.defineDone();

        call missing(billing_date2);
    end;
    
    /* For the first value of each account, get the first billing date and next billing date */
    if(first.acct_num) then do;
        n = 0;
        drop_flag = 0;

        /* Get the first value of billing date for the current acct_num */
        rc = h.Find();      
        billing_date_start = billing_date2;

        /* Get the next value of billing date for the current acct_num */
        rc = h.Find_Next(); 
        billing_date_end   = billing_date2;

        /* Set a drop flag if the tran date is not between the first and next value */
        drop_flag = (NOT (billing_date_start LE tran_date LE billing_date_end) );
    end;

    /* Count each row */
    n+1;

    if(n < 3 AND drop_flag = 0);

    keep acct_num billing_date tran_date;
run;

Upvotes: 0

Related Questions