user9016406
user9016406

Reputation:

Match Merge SAS Data Set Differences

I have a question in SAS. How do you normally determine a difference in two data sets? Below is the code that I have used. This is what I have used to determine if records are in one data set but not the other. For example, if someone running the code got higher or lower record counts. I am looking for any alternatives. Thanks.

 PROC SORT DATA=IVT.REQ_1_3_17  OUT=lob.REQ_1_3_17i;
     BY ACCT_NB DAY_TIME_DIM_NB TXN_AM  TXN_POSTING_SEQ CS_TXN_CD REQ_1_3_EXCL;
 RUN;

PROC SORT DATA=lob.REQ_1_3_17  OUT=lob.REQ_1_3_17l;
  BY ACCT_NB DAY_TIME_DIM_NB TXN_AM TXN_POSTING_SEQ CS_TXN_CD REQ_1_3_EXCL;
RUN;

DATA lob.MISMATCHES_1;
MERGE lob.REQ_1_3_17i(IN =A) lob.REQ_1_3_17l(IN=B);
BY ACCT_NB  DAY_TIME_DIM_NB TXN_AM TXN_POSTING_SEQ  CS_TXN_CD REQ_1_3_EXCL;
IF A AND NOT B;
RUN; 

Upvotes: 2

Views: 900

Answers (1)

Dmitry.Kolosov
Dmitry.Kolosov

Reputation: 635

There is a bunch of ways to find differences in two datasets. A standard one is PROC COMPARE, which has an OUT option to save all differences in a dataset. You can use special options to output only the different values:

proc compare    
    base=lob.REQ_1_3_17i
    comp=lob.REQ_1_3_17l
    out=differences
    outDif outBase outComp outNoEqual
    ;
run;

There is an automatic macro variable generated by the procedure, called SYSINFO. With it you can understand a general type of differences without looking into the OUT dataset. See SAS help for more details on it.

In the differences dataset you can find detailed information about the comparison. "_TYPE_" variable shows the type of difference. When its value is DIF, there is a difference in values. If a numeric variable is different, it will simply contain difference between two values. If character variable is differet, you will get something like ".X..X...", which means that two values have difference in the second and fifth characters.

Well, it all sounds nice, until your data has different number of records or different ID variables. For that you should use ID statement, so that values are grouped before comparison.

proc compare    
   .....;
   id acct_nb;
run;

Still in real life I do not use proc compare that often, unless it is a final comparison. I usually do not need to know that level of details and just need to select some ID variables which have different values. Merge statement is one way to solve it, but I prefer using SQL. Idea is to unite two datasets and then create a unique 'ID' variable for each of the records using catx and then search for it in each of the datasets.

proc SQL;
    create table differences as
    select distinct idVar1, idVar2, idVar3
    from data1 natural full join data2
    where 
    catx('@',idVar1, idVar2, ...., var1, var2, ...)
        not in (select catx('@',idVar1, idVar2, ...., var1, var2, ...) from data1)
    or
    catx('@',idVar1, idVar2, ...., var1, var2, ...)
        not in (select catx('@',idVar1, idVar2, ...., var1, var2, ...) from data2)
    ;
quit;

Note: This methods does not catch full duplicates, so if your data can have them and it is important to know about, use other approaches. You also need to keep in mind that the maximum string value is 32767 and replace '@' with something else (e.g., '$@`') if one of your variables can contain only that character.

Upvotes: 1

Related Questions