Reputation: 170
I'm trying to keep only the duplicate results for one column in a table. This is what I have.
proc sql;
create table DUPLICATES as
select Address, count(*) as count
from TEST_TABLE
group by Address
having COUNT gt 1
;
quit;
Is there any easier way to do this or an alternative I didn't think of? It seems goofy that I then have to re-join it with the original table to get my answer.
Upvotes: 2
Views: 30739
Reputation: 1031
Using proc sort with nodupkey and dupout will dedupe the data and give you an "out" dataset with duplicate records from the original dataset, but the "out" dataset does not include EVERY record with the ID variable - it gives you the 2nd, 3rd, 4th...Nth. So you aren't comparing all the duplicate occurrences of the ID variable when you use this method. It's great when you know what you want to remove and define enough by variables to limit this precisely, or if you know that your records with duplicate IDs are identical in every way and you just want them removed.
When there are duplicates in a raw file I receive, I like to compare all records where ID has more than one occurrence.
proc sort data=test nouniquekeys
uniqueout=singles
out=dups;
by=ID;
run;
Again, this method is great for working with messy raw data, and for debugging if your code might have produced duplicates.
Upvotes: 8
Reputation: 5470
That's easy using a data step:
proc sort data=TEST_TABLE nodupkey dupout=dups;
by Address;
run;
Refer to this documentation for further information
Upvotes: 7
Reputation: 123
proc sort data=TEST_TABLE;
by Address;
run;
data DUPLICATES;
set TEST_TABLE;
by Address;
if not (first.Address and last.Address) then output;
run;
Upvotes: 8
Reputation: 76
select field,count(field) from table
group by field having count(field) > 1
Upvotes: 1