Reputation: 43
I have two datasets , I have to check if the values are matching across the rows, if they are matching the flag should return a value as pass else fail.
datasets that I have :
Dataset 1
Name Address Age
1 John Carry 66
2 Andy NYC 32
3. Sam LA 42
Dataset 2
Initial Address Age
1 John Carry 56
2 Andy NV 32
3. Sam LA 42
Dataset that I need
Dataset Final
Name Address Age Flag
1 John Carry 66 Fail
2 Andy NYC 32 Fail
3. Sam LA 42 Pass
The flag should be appended to dataset 1 , also the variables names can be different in both the dataset in some cases (like Name vs Initial) but the values should match for the flag to be pass. I have used Case with join but having some trouble . Any help would be appreciated.
Upvotes: 1
Views: 71
Reputation: 12849
You can use a combination of proc compare
and proc sql
to achieve this. First, use proc compare
to create a dataset that only has rows with differences using name
as the ID. Note that we're assuming names all match up uniquely in this order, otherwise you will need to do a different method described below.
proc compare
base = dataset1
compare = dataset2(rename=(initial=name))
out = compare(keep=name)
noprint
outnoequal;
id name;
run;
This creates a one-column dataset of all names that have at least one difference in columns.
Name
John
Andy
Now you can join this back onto the original dataset using any method you want, whether it's a sort-merge, hash table, or SQL. To keep it simple, we'll use SQL.
proc sql;
create table want as
select t1.*
, CASE
when(t1.name = t2.name) then 'Fail'
else 'Pass'
END as flag
from dataset1 as t1
LEFT JOIN
compare as t2
ON t1.name = t2.name
;
quit;
Name Address Age flag
Andy NYC 32 Fail
John Carry 66 Fail
Sam LA 42 Pass
Let's say there are duplicate names in the datasets and you can't use it as an ID. Instead, create an index variable in each dataset and use that as the ID instead. For example:
data dataset1;
set dataset1;
id = _N_;
run;
data dataset2;
set dataset2;
id = _N_;
run;
Run proc compare
the same way as above, but replace name
with id
for the ID variable, then join on t1.id = t2.id
.
Upvotes: 0
Reputation: 4937
Here is an approach that is independent of your variable names, but is dependent on the internal order of variables being the same across the two data sets.
data data1;
input Name $ 1-4 Address $ 6-10 Age;
datalines;
John Carry 66
Andy NYC 32
Sam LA 42
data data2;
input Initial $ 1-4 Address $ 6-10 Age;
datalines;
John Carry 56
Andy NV 32
Sam LA 42
;
data temp2;
set data2;
length string $200;
string = cats(of _ALL_);
run;
data data1_new(drop = string);
if _N_ = 1 then do;
dcl hash h(dataset : 'temp2');
h.definekey('string');
h.definedone();
end;
set data1;
length string $200;
string = cats(of _ALL_);
flag = ifc(h.check(), 'Fail', 'Pass');
run;
Result:
Name Address Age flag
John Carry 66 Fail
Andy NYC 32 Fail
Sam LA 42 Pass
Upvotes: 0