Reputation: 199
I want to split a dataset into two subsets. I want the first subset to include all rows where the values of id or id1 appears more than once.
Like for this,
data test;
input id id1 var1 $ var2 $;
datalines;
1 11 A B
2 22 C D
3 11 E F
3 12 G H
7 15 K L
3 13 M N
2 10 I J
8 16 P L
;;;;
run;
I want the subsets to be
1 11 A B
2 22 C D
3 11 E F
3 12 G H
3 13 M N
2 10 I J
and
7 15 K L
8 16 P L
I can use the below code but it's not efficient as there are two passes through the dataset when splitting. Any other options?
proc sql;
create table DuplicateId as
select id from test
group by id
having count(id) > 1;
quit;
proc sql;
create table DuplicateID1 as
select id1 from test
group by id1
having count(id1) > 1;
quit;
proc sql;
create table split1 as
select * from test
where id not in (select id from DuplicateID) and id1 not in (select id1 from DuplicateID1);
quit;
proc sql;
create table split2 as
select * from test
where id in (select id from DuplicateID) or id1 in (select id1 from DuplicateID1);
quit;
Upvotes: 0
Views: 431
Reputation: 51566
Your code is making 4 passes thru the data. One each to find the unique (or duplicate) codes for each separate ID variable. And one each for each output dataset. You can reduce that last step to one pass by using indexed look ups.
proc sql ;
create table dup1 (index=(id)) as
select id from test
group by 1 having count(*) > 1
;
create table dup2 (index=(id1)) as
select id1 from test
group by 1 having count(*) > 1
;
quit;
data single multiple;
set test;
set dup1 key=id / unique;
if _iorc_ then set dup2 key=id1 / unique;
if _iorc_ then output single;
else output multiple;
_error_=0;
run;
If the lists of duplicate ids is small enough to store in memory you could use HASH objects to generate both of the duplicate lists in the a single pass of the data.
You could then write them back out and use the data step above to split. Or just make the step that makes creates the hash objects re-read the dataset again.
Upvotes: 1
Reputation: 21274
I thought a view may work, but it's definitely more passes and definitely more inefficient. A DoW loop is likely a better solution, at least only two passes, this one is likely 4 passes minimum! In line queries are definitely less efficient.
Interesting problem though. It may also help if you explain some context, perhaps there's an easier way to solve this issue somehow?
proc sql;
create view test2 as select *, (select count(_a1.ID) from test as _a1 where
_a1.id=t1.id group by id) as count1, (select count(_a2.ID1) from test as _a2
where _a2.id1=t1.id1 group by id1) as count2 from test as t1 order by 1, 2;
quit;
data single multiple;
set test2;
if max(count1, count2)>1 then
output multiple;
else
output single;
run;
Upvotes: 0