useR
useR

Reputation: 3082

Data step manipulation based on two fields conditioning

For the dataset,

data testing;
    input key $ output $;
    datalines;
1 A
1 B
1 C
2 A
2 B
2 C
3 A
3 B
3 C
;
run;

Desired Output,

1 A 
2 B
3 C

The logic is if either key or output appear within the column before then delete the observation.

1 A (as 1 and A never appear then keep the obs)
1 B (as 1 appear already then delete)
1 C (as 1 appear then delete)
2 A (as A appear then delete)
2 B (as 2 and B never appear then keep the obs)
2 C (as 2 appear then delete)
3 A (as A appear then delete)
3 B (as B appear then delete)
3 C (as 3 and C never appear then keep the obs)

My effort:

Upvotes: 0

Views: 44

Answers (1)

Joe
Joe

Reputation: 63434

The basic idea here is you keep a dictionary of what's already been used, and search that. Here's a simple array based method; a hash table might be better, certainly less memory intensive, anyway, and likely faster - I would leave that to your imagination.

data want;
  set testing;
  array _keys[30000] _temporary_;      *temporary arrays to store 'used' values;
  array _outputs[30000] $  _temporary_;
  retain _keysCounter 1 _outputsCounter 1;  *counters to help us store the values;
  if whichn(key, of _keys[*]) = 0 and whichc(output,of _outputs[*]) = 0 /* whichn and whichc search lists (or arrays) for a value.  */
    then do;
      _keys[_keysCounter] = key;            *store the key in the next spot in the dictionary;
      _keysCounter+1;     *increment its counter;
      _outputs[_outputsCounter] = output;   *store the output in the next spot in the dictionary;
      _outputsCounter+1;  *increment its counter;
      output;             *output the actual datarow;
  end;
  keep key output;
run;

Upvotes: 1

Related Questions