Ezio_Auditore
Ezio_Auditore

Reputation: 43

Creating flag based on matching values in the variable across two datasets

I have two datasets , I have to check if the values are matching across the rows, if they are matching the flag should return a value as pass else fail.

datasets that I have :

 Dataset 1                                
       Name    Address      Age   
1      John      Carry      66
2      Andy       NYC       32
3.     Sam        LA        42

 Dataset 2       
      Initial    Address      Age
1      John      Carry        56
2      Andy       NV          32
3.     Sam        LA          42

Dataset that I need

 Dataset Final                                
       Name    Address      Age   Flag
1      John      Carry      66    Fail
2      Andy       NYC       32    Fail
3.     Sam        LA        42    Pass

The flag should be appended to dataset 1 , also the variables names can be different in both the dataset in some cases (like Name vs Initial) but the values should match for the flag to be pass. I have used Case with join but having some trouble . Any help would be appreciated.

Upvotes: 1

Views: 71

Answers (2)

Stu Sztukowski
Stu Sztukowski

Reputation: 12849

You can use a combination of proc compare and proc sql to achieve this. First, use proc compare to create a dataset that only has rows with differences using name as the ID. Note that we're assuming names all match up uniquely in this order, otherwise you will need to do a different method described below.

proc compare 
    base    = dataset1
    compare = dataset2(rename=(initial=name))
    out     = compare(keep=name)
    noprint
    outnoequal;
    id name;
run;

This creates a one-column dataset of all names that have at least one difference in columns.

Name
John
Andy

Now you can join this back onto the original dataset using any method you want, whether it's a sort-merge, hash table, or SQL. To keep it simple, we'll use SQL.

proc sql;
    create table want as
        select t1.*
             , CASE
                   when(t1.name = t2.name) then 'Fail'
                   else 'Pass'
               END as flag
        from dataset1 as t1
        LEFT JOIN
             compare as t2
        ON t1.name = t2.name
    ;
quit;
Name    Address Age flag
Andy    NYC     32  Fail
John    Carry   66  Fail
Sam     LA      42  Pass

Let's say there are duplicate names in the datasets and you can't use it as an ID. Instead, create an index variable in each dataset and use that as the ID instead. For example:

data dataset1;
    set dataset1;
    id = _N_;
run;

data dataset2;
    set dataset2;
    id = _N_;
run;

Run proc compare the same way as above, but replace name with id for the ID variable, then join on t1.id = t2.id.

Upvotes: 0

PeterClemmensen
PeterClemmensen

Reputation: 4937

Here is an approach that is independent of your variable names, but is dependent on the internal order of variables being the same across the two data sets.

data data1;
input Name $ 1-4 Address $ 6-10 Age;
datalines;
John Carry 66 
Andy NYC   32 
Sam  LA    42 

data data2;
input Initial $ 1-4 Address $ 6-10 Age;
datalines;
John Carry 56 
Andy NV    32 
Sam  LA    42 
;

data temp2;
   set data2;
   length string $200;
   string = cats(of _ALL_);
run;

data data1_new(drop = string);

   if _N_ = 1 then do;
      dcl hash h(dataset : 'temp2');
      h.definekey('string');
      h.definedone();
   end;

   set data1;

   length string $200;
   string = cats(of _ALL_);

   flag = ifc(h.check(), 'Fail', 'Pass');

run;

Result:

Name  Address  Age  flag
John  Carry    66   Fail
Andy  NYC      32   Fail
Sam   LA       42   Pass

Upvotes: 0

Related Questions