Reputation: 479
So let's say I have a data set with five observations and two columns.
A B
Orange Banana
Plum Apple
Banana Orange
Plum Grape
Grape Apple
I want to remove the duplicate rows in which that A || B equals B || A, i.e., remove the observation A = Banana and B = Orange as there is an observation of A = Orange and B = Banana previously in the data set.
Upvotes: 0
Views: 302
Reputation: 27498
Consider a more general case of n fields comprising a composite key in which the sorted order of the key values is the deduping factor.
A hash of the composite key fields, sorted and delimitedly concatenated can be used to check for prior existence.
In this example the key field values are copied into a parallel array so that sortc
can order them without disturbing original data. The first key occurrence is output
.
data have;
call streaminit(123);
do row = 1 to 1e5;
array numfields numfield1-numfield5;
do over numfields;
numfields = floor(rand('uniform', 5));
end;
length charfield1-charfield5 $8;
array charfields charfield1-charfield5;
do over charfields;
charfields = byte(65 + floor(rand('uniform', 5)));
end;
output;
end;
run;
data want;
set have;
array keys(10) $200 _temporary_ ;
array nums numfield1-numfield5;
array chars charfield1-charfield5;
_index = 0;
do _numindex = 1 to dim(nums);
_index + 1;
keys(_index) = put(nums(_numindex),RB8.);
end;
do _charindex = 1 to dim(chars);
_index + 1;
keys(_index) = chars(_charindex);
end;
call sortc (of keys(*));
_sorted_composite_key = catx('ff'x, of keys(*));
if _n_ = 1 then do;
declare hash sortedKeys ();
sortedKeys.defineKey('_sorted_composite_key');
sortedKeys.defineDone();
end;
if sortedKeys.check() ne 0 then do;
output;
sortedKeys.add();
end;
drop _:;
run;
Upvotes: 0
Reputation: 51566
Just order the values, for example make A the smaller of A and B.
Easy when there are just two variables.
proc sql ;
create table want as
select distinct
case when (a<b) then a else b end as A
,case when (a<b) then b else a end as B
from have
;
quit;
Upvotes: 0
Reputation: 6378
You could sort the values within a row, so that both observation A = Banana and B = Orange and observation of A = Orange and B = Banana become A = Banana and B = Orange.
Below uses CALL SORTC to sort them. It uses a view to create new copies of the sorted variables, assuming you don't want to lose the original variables. Once you have it sorted, you can de-dup any way you want.
data have ;
input a $8. b $8. ;
cards ;
Orange Banana
Plum Apple
Banana Orange
Plum Grape
Grape Apple
;
data myview/view=myview ;
set have ;
mya=a ;
myb=b ;
call sortc(mya,myb) ;
run ;
proc sort nodupkey data=myview out=want(drop=mya myb) ;
by mya myb ;
run ;
Upvotes: 1