Reputation: 787
I want to eliminate duplicates from a database, based on an identifier, an order and a condition.
More precisely, I have data with several observations. I have sometimes a condition that makes me want to keep that observation anyway (let fix it condition=1), but then also keep the observation with the same identifier even if this condition does not hold (condition=0).
But if I have for one identifier several observations where condition=0 then I want to elminate duplicates, with criterion being having the greatest order.
Without the condition I can do that
proc sort data=have;
by identifier descending order;
run;
proc sort nudopkey data=have;
by identifier;
run;
But how to incorporate my condition in this ?
Edit 1 : add a database example :
data Test;
input identifier $ order condition;
datalines;
1023 1 0
1023 2 0
1064 2 0
1064 1 0
1098 1 0
1098 1 1
;
Then I want to keep
Edit 2 : tried to precise my conditions
Upvotes: 2
Views: 2573
Reputation: 151
I presume you want to eliminate duplicates only when the condition for all records for an identifier is set to 0. In that case you want to keep the record with the maximum order and eliminate all other records with the same identifier.
Proc sql;
create table want as
select *
from test
group by identifier
having max (condition) ne 0
or order eq max (order)
;
Quit;
This will keep all rows for an Identifier where the maximum condition = 1, or in the case of those where maximum condition = 0, select the row with the maximum order.
Is that what you want?
Upvotes: 2
Reputation: 63424
Some of this depends on how you define 'condition'. Is your condition easily verifiable on every record for that identifier? Then you can do something like this.
If the condition is true for all records in that ID, all will have the same value (flag=0) and nodupkey on by identifier flag;
will remove extras. If the condition is false for all records, those will not be removed. If it's true for some and false for some, and you want to remove only some of the records with that identifier (only the duplicates where it is true), then you have to make sure that either it's sorted to have all of the condition=true records at top, or have a separate flag counter that determines what value the flag will be (since it sometimes will go to 0 in the middle, so 0 0 0 1 2 3 0 4 5 6
is what you want, not 0 0 0 1 2 3 0 0 1 2
).
Perhaps easier to see is to do it within a datastep. After sorting by identifier descending order
:
data want;
set have;
by identifier descending order;
if (condition=true) and not (first.identifier) then delete;
run;
This will, again, work if either condition=true is always at the top, or if it's always consistent within one ID group. If it's inconsistent and mixed, then you need to keep track of whether you've kept one where it was true (assuming you want to), or it might delete all records where it is true; use a separate variable to keep track of how many you've kept. first.identifier
will be 1/TRUE for the first record for that identifier only, not taking into account the condition. You could also create the flag, then sort by identifier flag descending order;
and guarantee the condition=true are at the top (either by making flag=0 for true, or sorting by descending flag
.)
Upvotes: 1