Reputation:
I have a question in SAS. How do you normally determine a difference in two data sets? Below is the code that I have used. This is what I have used to determine if records are in one data set but not the other. For example, if someone running the code got higher or lower record counts. I am looking for any alternatives. Thanks.
PROC SORT DATA=IVT.REQ_1_3_17 OUT=lob.REQ_1_3_17i;
BY ACCT_NB DAY_TIME_DIM_NB TXN_AM TXN_POSTING_SEQ CS_TXN_CD REQ_1_3_EXCL;
RUN;
PROC SORT DATA=lob.REQ_1_3_17 OUT=lob.REQ_1_3_17l;
BY ACCT_NB DAY_TIME_DIM_NB TXN_AM TXN_POSTING_SEQ CS_TXN_CD REQ_1_3_EXCL;
RUN;
DATA lob.MISMATCHES_1;
MERGE lob.REQ_1_3_17i(IN =A) lob.REQ_1_3_17l(IN=B);
BY ACCT_NB DAY_TIME_DIM_NB TXN_AM TXN_POSTING_SEQ CS_TXN_CD REQ_1_3_EXCL;
IF A AND NOT B;
RUN;
Upvotes: 2
Views: 900
Reputation: 635
There is a bunch of ways to find differences in two datasets. A standard one is PROC COMPARE, which has an OUT option to save all differences in a dataset. You can use special options to output only the different values:
proc compare
base=lob.REQ_1_3_17i
comp=lob.REQ_1_3_17l
out=differences
outDif outBase outComp outNoEqual
;
run;
There is an automatic macro variable generated by the procedure, called SYSINFO. With it you can understand a general type of differences without looking into the OUT dataset. See SAS help for more details on it.
In the differences dataset you can find detailed information about the comparison. "_TYPE_" variable shows the type of difference. When its value is DIF, there is a difference in values. If a numeric variable is different, it will simply contain difference between two values. If character variable is differet, you will get something like ".X..X...", which means that two values have difference in the second and fifth characters.
Well, it all sounds nice, until your data has different number of records or different ID variables. For that you should use ID statement, so that values are grouped before comparison.
proc compare
.....;
id acct_nb;
run;
Still in real life I do not use proc compare that often, unless it is a final comparison. I usually do not need to know that level of details and just need to select some ID variables which have different values. Merge statement is one way to solve it, but I prefer using SQL. Idea is to unite two datasets and then create a unique 'ID' variable for each of the records using catx and then search for it in each of the datasets.
proc SQL;
create table differences as
select distinct idVar1, idVar2, idVar3
from data1 natural full join data2
where
catx('@',idVar1, idVar2, ...., var1, var2, ...)
not in (select catx('@',idVar1, idVar2, ...., var1, var2, ...) from data1)
or
catx('@',idVar1, idVar2, ...., var1, var2, ...)
not in (select catx('@',idVar1, idVar2, ...., var1, var2, ...) from data2)
;
quit;
Note: This methods does not catch full duplicates, so if your data can have them and it is important to know about, use other approaches. You also need to keep in mind that the maximum string value is 32767 and replace '@' with something else (e.g., '$@`') if one of your variables can contain only that character.
Upvotes: 1