Reputation: 479

Remove Duplicates where A || B equals B || A

So let's say I have a data set with five observations and two columns.

A       B
Orange  Banana
Plum    Apple
Banana  Orange
Plum    Grape
Grape   Apple

I want to remove the duplicate rows in which that A || B equals B || A, i.e., remove the observation A = Banana and B = Orange as there is an observation of A = Orange and B = Banana previously in the data set.

Upvotes: 0

Answers (3)

Richard

Reputation: 27498

Consider a more general case of n fields comprising a composite key in which the sorted order of the key values is the deduping factor.

A hash of the composite key fields, sorted and delimitedly concatenated can be used to check for prior existence.

In this example the key field values are copied into a parallel array so that sortc can order them without disturbing original data. The first key occurrence is output.

data have;
  call streaminit(123);

  do row = 1 to 1e5;
    array numfields numfield1-numfield5;
    do over numfields;
      numfields = floor(rand('uniform', 5));
    end;
    length charfield1-charfield5 $8;
    array charfields charfield1-charfield5;
    do over charfields;
      charfields = byte(65 + floor(rand('uniform', 5)));
    end;
    output;
  end;
run;

data want;
  set have;
  array keys(10) $200 _temporary_ ;

  array nums numfield1-numfield5;
  array chars charfield1-charfield5;

  _index = 0;
  do _numindex = 1 to dim(nums);
    _index + 1;
    keys(_index) = put(nums(_numindex),RB8.);
  end;

  do _charindex = 1 to dim(chars);
    _index + 1;
    keys(_index) = chars(_charindex);
  end;

  call sortc (of keys(*));

  _sorted_composite_key = catx('ff'x, of keys(*));

  if _n_ = 1 then do;
    declare hash sortedKeys ();
    sortedKeys.defineKey('_sorted_composite_key');
    sortedKeys.defineDone();
  end;

  if sortedKeys.check() ne 0 then do;
    output;
    sortedKeys.add();
  end;

  drop _:;
run;

Upvotes: 0

Tom

Reputation: 51566

Just order the values, for example make A the smaller of A and B.

Easy when there are just two variables.

proc sql ;
create table want as
  select distinct
    case when (a<b) then a else b end as A
   ,case when (a<b) then b else a end as B
  from have
;
quit;

Upvotes: 0

Quentin

Reputation: 6378

You could sort the values within a row, so that both observation A = Banana and B = Orange and observation of A = Orange and B = Banana become A = Banana and B = Orange.

Below uses CALL SORTC to sort them. It uses a view to create new copies of the sorted variables, assuming you don't want to lose the original variables. Once you have it sorted, you can de-dup any way you want.

data have ;
  input a $8. b $8. ;
  cards ;
Orange  Banana
Plum    Apple
Banana  Orange
Plum    Grape
Grape   Apple
;


data myview/view=myview ;
  set have ;
  mya=a ;
  myb=b ;
  call sortc(mya,myb) ;
run ;


proc sort nodupkey data=myview out=want(drop=mya myb) ;
  by mya myb ;
run ;

Upvotes: 1

Remove Duplicates where A || B equals B || A

Answers (3)

Related Questions