Mike L
Mike L

Reputation: 496

Need to compare 2 variables, each coming from a separate data set, and flag differences

I have 2 SAS data sets, Lab and Rslt. They both have all of the same variables, but Rslt is supposed to have what is essentially a subset of Lab. For what I'm trying to do, there are 4 important variables: visit, accsnnum, battrnam, and lbtestcd. All are character variables. I want to compare the two files Lab and Rslt to find out where they vary -- specifically, I need to know the count of lbtestcd per unique accsnnum.

But I must control for a few factors. First, I only need to compare observations that have "Lipid Panel" or "Chemistry (6)" in the battrnam variable. The Rslt file only contains these observations, so we don't need to worry about that one. So I subsetted Lab using this code:

data work.lab;
  set  livingston.ndb_lab_1;
  where battrnam contains "Lipid Panel" or battrnam = "Chemisty (6)";
run; 

This worked fine. Now, I need to control for the variable visit. I need to get rid of all observations in both Lab and Rslt that have visits that contain "Day 1" or "Screening". I accomplished this using the following code:

data work.lab;
set work.lab;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;

data work.rslt;
set work.rslt;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run; 

Now this is where I get stuck. I need to create a way to compare the count of lbtestcd by accsnnum between the two separate files Lab and Rslt, and I need a way for it to flag the accsnum where there is a difference between Lab and Rslt for the count of lbtestcd. For example, if Lab has an accsnum A1 that has 5 unique lbtestcd values, and Rslt has the accsnum A1 with 7 unique lbtestcd value, I need that one to be brought to my attention.

I can do a proc freq for each file, but these are large data sets and I don't want to have to compare by hand. Perhaps exporting the count of lbtestcd by accsnum to a variable in a new 3rd dataset for each of the 2 files Lab and Rslt, then creating a variable that is the difference of these two? So that if difference != 0 then I can get a report of those asscnum? Advice in SQL will work too, as I can run that through SAS.

Edit I've used some SQL to get the count of lbtestcd by accsnum for each data set using the code below, though I still need to figure out how to export these values to a data set to compare.

proc sql;
select accsnnum, count(lbtestcd)
from work.lab1
group by accsnnum;
quit;

proc sql;
select accsnnum, count(lbtestcd)
from work.rslt1
group by accsnnum;
quit;

Thanks for any and all help you can give. This one is really stumping me!

Upvotes: 1

Views: 1115

Answers (1)

Joe
Joe

Reputation: 63434

I would do a PROC FREQ on each dataset (or proc whatever-you-like-that-does-counts) and then use PROC COMPARE. For example:

proc freq data=rslt1;
tables accsnnum*ibtestcd/out=rsltcounts;
run;

proc freq data=lab1;
tables accsnnum*ibtestcd/out=labcounts;
run;

proc compare base=lab1 compare=rslt1 out=compares /* options */;
by accsnnum;
run;

PROC COMPARE has a lot of options; in this case the most helpful would probably be:

  • outnoequal - only outputs rows for each row that are not identical in the two datasets
  • outbase and outcomp - outputs a row for each of BASE and COMPARE datasets (if OUTNOEQUAL, then only when they differ)
  • outdif - outputs 'difference' rows, ie, one minus the other; this may or may not be helpful for you

The documentation lists all of the options. You may also need to look at the METHOD options if your data might have numeric precision issues.

Upvotes: 2

Related Questions