Reputation: 45
I need to assign the same ID to every combination of same values in three variables and i really have no clue of what can I do to spot combination of the same three letters among the three variables (A-B-C should match with B-C-A) Here's my input data:
data HAVE;
input ID VAR1 VAR2 VAR3 $;
DATALINES;
001 A B C
002 A C B
003 B C A
004 A B
005 B A
006 D E F
007 E F D
008 F E D
009 E F
010 F E
;
RUN;
And the resulting ID_NEW should be:
data HAVE;
input ID VAR1 VAR2 VAR3 $ ID_NEW;
DATALINES;
001 A B C 1
002 A C B 1
003 B C A 1
004 A B 2
005 B A 2
006 D E F 3
007 E F D 3
008 F E D 3
009 E F 4
010 F E 4
;
RUN;
I am able to spot combination of two by proc sql and performing a left join with keys t1.var1=t2.var2 and t1.var2=t2.var1 but it comes to spot the three letters combinations, I want to avoid the join because I could have 6 possible combinations and I feel that there's a smart way to so, without repeating the join 6 times! Perhaps with a combination of catt and scan functions?
Thank you in advance for your help :) !
Upvotes: 1
Views: 437
Reputation: 27508
ID_NEW is based on combination. Each unique combination can be thought of as a set element. The set element is arrived at by sorting the data values (which converts a values permutation to a distinct permutation representing the values combination) and concatenating them with delimiters into a single value that is a set element. The delimiting ensures a separation of contributing value in the element. A set element can also be thought of as a hash key, and there in comes the single-pass solution.
Example
perm sort element/key
A B C -> A B C -> A,B,C
B C A -> A B C -> A,B,C
CALL SORTC
will sort variables, but you do not want to sort the actual variables, so the values have to be copied into other variables that will be sorted (and dropped from output)
A hash
can associate a key value with a combination_id. The combo_id is incremented each time a new key value is encountered, or retrieved when the key already exists.
Example:
data HAVE;
input ID VAR1 $ VAR2 $ VAR3 $;
DATALINES;
001 A B C
002 A C B
003 B C A
004 A B .
005 B A .
006 D E F
007 E F D
008 F E D
009 E F .
010 F E .
011 C A B
012 E D F
RUN;
data want;
if 0 then set have; * prep pdv;
length _key $100 combo_id 8; * host variables for hash;
if _n_ = 1 then do;
declare hash ids();
ids.defineKey ('_key');
ids.defineData ('combo_id');
ids.defineDone();
end;
set have;
* copy data into variables that will be sorted;
* combination of raw data;
* use cats in case some of variables are numeric;
_item1 = cats(var1);
_item2 = cats(var2);
_item3 = cats(var3);
* alternative if combo_id is to be based on formatted data values;
* use vvalue to obtained formatted representation of data values;
* _item1 = vvalue(var1);
* _item2 = vvalue(var2);
* _item3 = vvalue(var3);
array items _item1-_item3; * arrayify the _items so they can be sorted;
call sortc(of items(*));
* delimit with hex 255, a character not likely to occur
* in the data value or formatted representation;
_key = catx('FF'x, of items(*));
* retrieve or increment the combo_id;
if ids.find() ne 0 then do;
combo_id = ids.num_items + 1;
ids.add();
end;
drop _:;
run;
Upvotes: 1
Reputation: 1319
You will be able to do this using the SORTC function, which sorts an array of character values into alphabetical order:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003106052.htm
One way would be to create a new variable that contains the values of VAR1-VAR3 in alphabetical order:
data want;
length sorted_vars $ 20;
set have;
array vars[*] var1-var3;
call sortc(of vars[*]);
sorted_vars = cats(of vars[*]);
run;
The above code isn't tested but should be pretty close. From there, you can sort on sorted_vars
and increment your id_new
variable on each first.sorted_vars
.
Upvotes: 0