Reputation: 27047
I currently have a table with multiple duplicates across the primary key columns due to a lack of integrity checks. Be that as it may, I'm attempting to remove the duplicates. The problem is, there's no id
column, which means that finding the duplicates is nontrivial.
My current solution involves using the count(*)... having
construct to create a second table, and selecting the rows to be deleted into there. My problem is that the SAS delete
command doesn't allow for the following:
proc sql;
delete from TableA
where (v1,v2,v3) in TableB
Is there any way to delete from one table based on the contents of another?
Upvotes: 1
Views: 14681
Reputation: 1270401
You can do this with a correlated subquery:
proc sql;
delete from TableA
where exists (select 1
from TableB
where b.pk = TableA.v1 or b.pk = TableA.v2 or b.pk = TableA.v3
)
Upvotes: 1
Reputation: 9618
If I understand correctly, you want to remove every observation from your dataset where more than one observation has the same value of your "key" variables (removing ALL duplcates).
The best and easiest way to do that with SAS is to sort that dataset by your "key" variables and then use another data step to create your new copy. Harder to explain than to illustrate:
data have;
input x y z;
datalines4;
1 2 3
1 2 3
2 3 4
3 4 5
3 4 6
3 4 7
4 5 6
4 5 6
;;;;
run;
proc sort data=have;
by x y z;
run;
data want;
set have;
by x y z;
if first.z and last.z;
run;
The sub-setting IF
statement keeps only the four "unique" observations using the automatic FIRST.
and LAST.
variables created when you use a BY
statement.
If instead you wanted to keep one of the duplicates, you could use the NODUPKEY
option with PROC SORT
:
proc sort nodupkey data=have out=want2;
by x y z;
run;
This will eliminate two of the observations from the example, leaving you with unique values for your key. You cannot control which observations are retained using this technique.
Upvotes: 2
Reputation: 3417
Try this:
proc sql;
delete from TableA as a
where a.v1 = (select b.v1 from TableB as b where a.primaryKey = b.foreignKeyForTableA)
and so on for the other values. However, since you may recieve duplicates (ie more than one result) from TableB you might want to try out "select distinct" or "select Top 1" to only get one result.
Upvotes: 1