DukeLuke
DukeLuke

Reputation: 315

SAS First Dot Variable

I'm trying to use the First Dot automatic variable in a SAS datastep, but it's not producing the output I'm looking for. I think that I may not be understanding how SAS is grouping my variables with my by statement

Here's my code:

data testing;
input SSN ACCT_num YYMM ID $;
datalines;
123456789   987654321   1801    1
123456789   987654321   1711    9
123456789   476543218   1801    2
123456789   476543218   1711    10
;
run;

proc sort data=testing out=tst;
by ssn acct_num id descending yymm ;
run;

data final;
set tst;
by ssn acct_num id;
if first.id;
run;

proc print data=testing;
proc print data=tst;
proc print data=final;
quit;

Here's the output:

Obs SSN ACCT_num YYMM ID 
1 123456789 476543218 1711 10 
2 123456789 476543218 1801 2 
3 123456789 987654321 1801 1 
4 123456789 987654321 1711 9 

Here's my desired output:

Obs SSN ACCT_num YYMM ID 
1 123456789 476543218 1801 2 
2 123456789 987654321 1801 1 

I'm looking to group by an SSN and an Account Number, obtaining the latest ID variable where the latest would be the one with the highest YYMM value.

Upvotes: 0

Views: 643

Answers (2)

Kiran
Kiran

Reputation: 3315

one more to achieve this using proc sql;

proc sql;
  select * from testing
  group by ssn, acct_num
 having yymm= max(yymm);

Upvotes: 1

Len Greski
Len Greski

Reputation: 10855

Based on the output, it appears the desired output is to sort on SSN, Account Number, and descending YYMM, and select for first.ACCT_num, as follows.

data testing;
input SSN ACCT_num YYMM ID $;
datalines;
123456789   987654321   1801    1
123456789   987654321   1711    9
123456789   476543218   1801    2
123456789   476543218   1711    10
;
run;

proc sort data=testing out=tst;
by ssn acct_num descending yymm ;
run;

data final;
set tst;
by ssn acct_num;
if first.acct_num;
run;

proc print data=testing;
proc print data=tst;
proc print data=final;
quit;

...and the output:

enter image description here

Explanation: the BY statement in a SAS data step creates logical variables first. and last. for each variable in the BY statement.

When there are multiple BY variables, there is an implied order where the lower sort order items (acct_num, YYMM, and ID) vary in first. and last. more frequently than the higher sort order items. Given the code in the OP,ID varies most frequently.

In the raw data, the value of ID is unique across all observations. Therefore, for all four rows of the input data, first.ID and last.ID are TRUE. This explains the output from the original post.

Placing acct_num before descending YYMM in the sort order means that when first.acct_num is TRUE, the observation on the program data vector will have the highest value of YYMM for that account number.

Also note that if first.YYMM was used in the data step to select the data, it would have produced incorrect results because within a given acct_num, yymm is never duplicated across multiple observations, so first.YYMM is always TRUE.

One can demonstrate this by saving the first. and last. variables to the output data set and printing them as follows.

data final2;
   set tst;
   by ssn acct_num descending yymm;
   firstSSN = first.ssn;
   lastSSN = last.ssn;
   firstACCT = first.acct_num;
   lastACCT = last.acct_num;
   firstYYMM = first.yymm;
   lastYYMM = last.yymm;
run;
proc print data = final2;
  title "Final Data Set Illustrating First. and Last. Variables";
run;

...and the output:

enter image description here

Upvotes: 2

Related Questions