Reputation: 3082
For the dataset,
data testing;
input key $ output $;
datalines;
1 A
1 B
1 C
2 A
2 B
2 C
3 A
3 B
3 C
;
run;
Desired Output,
1 A
2 B
3 C
The logic is if either key or output appear within the column before then delete the observation.
1 A (as 1 and A never appear then keep the obs)
1 B (as 1 appear already then delete)
1 C (as 1 appear then delete)
2 A (as A appear then delete)
2 B (as 2 and B never appear then keep the obs)
2 C (as 2 appear then delete)
3 A (as A appear then delete)
3 B (as B appear then delete)
3 C (as 3 and C never appear then keep the obs)
My effort:
Upvotes: 0
Views: 44
Reputation: 63434
The basic idea here is you keep a dictionary of what's already been used, and search that. Here's a simple array based method; a hash table might be better, certainly less memory intensive, anyway, and likely faster - I would leave that to your imagination.
data want;
set testing;
array _keys[30000] _temporary_; *temporary arrays to store 'used' values;
array _outputs[30000] $ _temporary_;
retain _keysCounter 1 _outputsCounter 1; *counters to help us store the values;
if whichn(key, of _keys[*]) = 0 and whichc(output,of _outputs[*]) = 0 /* whichn and whichc search lists (or arrays) for a value. */
then do;
_keys[_keysCounter] = key; *store the key in the next spot in the dictionary;
_keysCounter+1; *increment its counter;
_outputs[_outputsCounter] = output; *store the output in the next spot in the dictionary;
_outputsCounter+1; *increment its counter;
output; *output the actual datarow;
end;
keep key output;
run;
Upvotes: 1