Reputation: 315
I'm trying to use the First Dot automatic variable in a SAS datastep, but it's not producing the output I'm looking for. I think that I may not be understanding how SAS is grouping my variables with my by statement
Here's my code:
data testing;
input SSN ACCT_num YYMM ID $;
datalines;
123456789 987654321 1801 1
123456789 987654321 1711 9
123456789 476543218 1801 2
123456789 476543218 1711 10
;
run;
proc sort data=testing out=tst;
by ssn acct_num id descending yymm ;
run;
data final;
set tst;
by ssn acct_num id;
if first.id;
run;
proc print data=testing;
proc print data=tst;
proc print data=final;
quit;
Here's the output:
Obs SSN ACCT_num YYMM ID
1 123456789 476543218 1711 10
2 123456789 476543218 1801 2
3 123456789 987654321 1801 1
4 123456789 987654321 1711 9
Here's my desired output:
Obs SSN ACCT_num YYMM ID
1 123456789 476543218 1801 2
2 123456789 987654321 1801 1
I'm looking to group by an SSN and an Account Number, obtaining the latest ID variable where the latest would be the one with the highest YYMM value.
Upvotes: 0
Views: 643
Reputation: 3315
one more to achieve this using proc sql;
proc sql; select * from testing group by ssn, acct_num having yymm= max(yymm);
Upvotes: 1
Reputation: 10855
Based on the output, it appears the desired output is to sort on SSN, Account Number, and descending YYMM, and select for first.ACCT_num, as follows.
data testing;
input SSN ACCT_num YYMM ID $;
datalines;
123456789 987654321 1801 1
123456789 987654321 1711 9
123456789 476543218 1801 2
123456789 476543218 1711 10
;
run;
proc sort data=testing out=tst;
by ssn acct_num descending yymm ;
run;
data final;
set tst;
by ssn acct_num;
if first.acct_num;
run;
proc print data=testing;
proc print data=tst;
proc print data=final;
quit;
...and the output:
Explanation: the BY
statement in a SAS data step creates logical variables first.
and last.
for each variable in the BY
statement.
When there are multiple BY
variables, there is an implied order where the lower sort order items (acct_num
, YYMM
, and ID
) vary in first.
and last.
more frequently than the higher sort order items. Given the code in the OP,ID
varies most frequently.
In the raw data, the value of ID
is unique across all observations. Therefore, for all four rows of the input data, first.ID
and last.ID
are TRUE. This explains the output from the original post.
Placing acct_num
before descending YYMM
in the sort order means that when first.acct_num
is TRUE, the observation on the program data vector will have the highest value of YYMM
for that account number.
Also note that if first.YYMM
was used in the data step to select the data, it would have produced incorrect results because within a given acct_num
, yymm
is never duplicated across multiple observations, so first.YYMM
is always TRUE.
One can demonstrate this by saving the first.
and last.
variables to the output data set and printing them as follows.
data final2;
set tst;
by ssn acct_num descending yymm;
firstSSN = first.ssn;
lastSSN = last.ssn;
firstACCT = first.acct_num;
lastACCT = last.acct_num;
firstYYMM = first.yymm;
lastYYMM = last.yymm;
run;
proc print data = final2;
title "Final Data Set Illustrating First. and Last. Variables";
run;
...and the output:
Upvotes: 2