Reputation: 43
I have a dataset which has an account number a billing cycle date and transaction date . Every account should have first two occurrences for the billing cycle date and then the transaction date has to checked if it lies between those two billing cycle dates.
Data that I have .
input acct_num biiling_date tran_date ;
DATALINES;
1111 23FEB2020 27FEB2020
1111 23MAR2020 27FEB2020
2222 17MAR2020 22MAR2020
2222 27APR2020 22MAR2020
2222 27MAY2020 22MAR2020
3333 23JUL2018 02JUN2022
3333 23AUG2018 02JUN2022
Data that I want.
1111 23FEB2020 27FEB2020
1111 23MAR2020 27FEB2020
2222 17MAR2020 22MAR2020
2222 27APR2020 22MAR2020
If you see in the final dataset, third occurrence of account 2222 has been removed and entire rows of account 3333 has been removed because no tran_date was between the first and second billing cycle date for that account.
Thanks.
Upvotes: 0
Views: 161
Reputation: 12909
The main problem here is that you want to look ahead one row for each group, then make the decision to keep or drop both rows simultaneously. We can make this easier by transposing the data into a long format then merging it back with the original data:
proc transpose data=have out=have_tpose(keep=acct_num COL1-COL2);
by acct_num;
var billing_date;
run;
acct_num COL1 COL2
1111 23FEB2020 23MAR2020
2222 17MAR2020 27APR2020
3333 23JUL2018 23AUG2018
All we care about are the first two rows. COL1
and COL2
represent those first two rows.
Now we can merge this back to our original dataset and check if the transaction date is between COL1
and COL2
for every acct_num
. Since we only need to keep the first two rows of data, we're going to count the number of rows, n
, and keep track of it for every group. At the start of each group we'll reset the counter.
We will only output if:
n
is < 3tran_date
is between COL1
and COl2
data want;
merge have
have_tpose
;
by acct_num;
if(first.acct_num) then n = 0;
n+1;
if(n < 3 AND COL1 < tran_date < COL2);
drop n COL1-COL2;
run;
acct_num billing_date tran_date
1111 23FEB2020 27FEB2020
1111 23MAR2020 27FEB2020
2222 17MAR2020 22MAR2020
2222 27APR2020 22MAR2020
To do this with lags, you'll need to still do this in two steps: first to create a list of values to drop, and second to create the final filtered list.
data drop_list;
set have;
by acct_num;
lag_billing_date = lag(billing_date);
if(first.acct_num) then do;
n = 0;
call missing(lag_billing_date);
end;
n+1;
if(n = 2) AND NOT (lag_billing_date LE tran_date LE billing_date)
then drop_flag = 1;
if(drop_flag);
keep acct_num;
run;
data want;
merge have
drop_list(in=droplist)
;
by acct_num;
if(first.acct_num) then n = 0;
n+1;
if(n < 3 AND NOT droplist);
run;
If you want a really crazy way of doing this in one step with hash tables, here's that method:
data want;
set have;
by acct_num;
retain drop_flag;
/* Load the dataset into memory so we can search ahead with hash methods */
if(_N_ = 1) then do;
format billing_date2
billing_date_start
billing_date_end date9.
;
dcl hash h(dataset: 'have(rename=(billing_date = billing_date2))', multidata: 'yes');
h.defineKey('acct_num');
h.defineData('billing_date2');
h.defineDone();
call missing(billing_date2);
end;
/* For the first value of each account, get the first billing date and next billing date */
if(first.acct_num) then do;
n = 0;
drop_flag = 0;
/* Get the first value of billing date for the current acct_num */
rc = h.Find();
billing_date_start = billing_date2;
/* Get the next value of billing date for the current acct_num */
rc = h.Find_Next();
billing_date_end = billing_date2;
/* Set a drop flag if the tran date is not between the first and next value */
drop_flag = (NOT (billing_date_start LE tran_date LE billing_date_end) );
end;
/* Count each row */
n+1;
if(n < 3 AND drop_flag = 0);
keep acct_num billing_date tran_date;
run;
Upvotes: 0