Reputation: 113
data emp;
input empID;
cards;
2
3
2
4
3
5
3
2
run;
I want to write proc sql delete query to remove all duplicate records but keep one so that dataset will have only following records. I want to do it using delete query, don't want to create table
empID
2
3
4
5
I have tried this but its not working.
proc sql;
delete from emp where empid in
( select t.empid from emp t where t.empid=empid
group by t.empId having count(t.empid)>2
);
quit;
Upvotes: 0
Views: 1407
Reputation: 7602
What's wrong with using proc sort
and the nodup
option?
proc sort data=emp nodup;
by empid;
run;
2nd answer :
If you are unable to create a table, either directly or using proc sort
for example, then I believe your only option is to use a data step with the modify
statement. This updates an existing dataset in place, rather than creating a new one and replacing the existing one when the code has successfully run.
As your data appears to be unsorted then a more creative approach is required, instead of a simple first.empid
that could be done with sorted data. What I have done is to build up a list of unique values as the dataset is being read, then look for the current value in that list. If it exists then the row is deleted. There's no need to drop the temporary field I have created (_list
) as modify
does not allow new fields to be added, it's only used behind the scenes.
Note that deleting records in this way (and using the delete from
syntax in proc sql
) doesn't physically remove the records, it just marks them as deleted so they don't appear when viewed or queried. If you open the Emp dataset after running the code, you'll see that there are missing row numbers.
data emp;
modify emp;
length _list $200; /* set length of temporary field */
retain _list; /* retain existing values */
if findw(_list,strip(empid))>0 then remove; /* delete observation if empid already exists */
else call catx(',',_list,empid); /* add current empid to list if it doesn't already exist */
run;
I should add that this answer is not very scalable, e.g. if you have a large dataset with many unique values then the _list variable will need to have a very long length to accommodate them. The better option would be if the Emp dataset was sorted, or at least indexed, by EmpId. That way you can do the following (emp is included twice in the set
statement as the by
statement is only valid when there are 2 datasets in the set
statement. This is a trick to enable first.
processing.)
data emp;
modify emp (obs=0) emp;
by empid;
if not first.empid then remove;
run;
Upvotes: 0
Reputation: 1270873
I don't think SAS proc sql
has any concept of a "rownum" column or "ctid" column. So, the easiest way is to create a new data set:
proc sql;
create table emp2 as
select distinct empid
from emp;
Upvotes: 4