mac_21
mac_21

Reputation: 113

SAS remove all but one duplicate record in sas

data emp;
input empID;
cards;
2 
3
2  
4
3 
5 
3 
2  
run;

I want to write proc sql delete query to remove all duplicate records but keep one so that dataset will have only following records. I want to do it using delete query, don't want to create table

empID
2
3
4
5

I have tried this but its not working.

proc sql;
delete from emp where empid in
(  select t.empid from emp t where t.empid=empid
   group by t.empId having count(t.empid)>2
 );
quit;

Upvotes: 0

Views: 1407

Answers (2)

Longfish
Longfish

Reputation: 7602

What's wrong with using proc sort and the nodup option?

proc sort data=emp nodup;
by empid;
run;

2nd answer :

If you are unable to create a table, either directly or using proc sort for example, then I believe your only option is to use a data step with the modify statement. This updates an existing dataset in place, rather than creating a new one and replacing the existing one when the code has successfully run.

As your data appears to be unsorted then a more creative approach is required, instead of a simple first.empid that could be done with sorted data. What I have done is to build up a list of unique values as the dataset is being read, then look for the current value in that list. If it exists then the row is deleted. There's no need to drop the temporary field I have created (_list) as modify does not allow new fields to be added, it's only used behind the scenes.

Note that deleting records in this way (and using the delete from syntax in proc sql) doesn't physically remove the records, it just marks them as deleted so they don't appear when viewed or queried. If you open the Emp dataset after running the code, you'll see that there are missing row numbers.

data emp;
modify emp;
length _list $200; /* set length of temporary field */
retain _list; /* retain existing values */
if findw(_list,strip(empid))>0 then remove; /* delete observation if empid already exists */
else call catx(',',_list,empid); /* add current empid to list if it doesn't already exist */
run;

I should add that this answer is not very scalable, e.g. if you have a large dataset with many unique values then the _list variable will need to have a very long length to accommodate them. The better option would be if the Emp dataset was sorted, or at least indexed, by EmpId. That way you can do the following (emp is included twice in the set statement as the by statement is only valid when there are 2 datasets in the set statement. This is a trick to enable first. processing.)

data emp;
modify emp (obs=0) emp;
by empid;
if not first.empid then remove;
run;

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270873

I don't think SAS proc sql has any concept of a "rownum" column or "ctid" column. So, the easiest way is to create a new data set:

proc sql;
    create table emp2 as
        select distinct empid
        from emp;

Upvotes: 4

Related Questions