user601828
user601828

Reputation: 509

How do I assign numeric values to the alphabet in SAS

I'm trying to convert a character string to a numeric variable and then sum the values of each character to use as a unique identifier for that field. So for example, I would like A=1, B=2, C=3.....X=24 Y=25 Z=26. Say my string is "CAB" so after running the code I would like the result to be an intermidiary column of numbers, where the value for CAB IS 3 1 2 and the result column would be derived by summing the string 3+1+2= 6 and show the value of the intermideate column, so the final value woud be 6.

Here is the sas code I used to convert the characters to numbers, but I need help with the result column.

DATA CHAR_VALUE;
SET WORK.XYZ;
CHAR_2_NUM=TRANSLATE(MY_VAR_CHAR, '1 2 3 ...24 25 26', 'A B C ...X Y Z');
NUM_CHAR=INPUT(CHAR_2_NUM,32.);
RUN;

Thanks in advance...I appreciate any help or suggestions. -rachel

Upvotes: 1

Views: 14299

Answers (3)

StatsStudent
StatsStudent

Reputation: 1604

Another simple solution is based on the collate function:

To convert a variable called MyNumbers (in the range of 1 to 26) to English upper-case characters, one can use:

collate(64 + MyNumbers, 64 + MyNumbers)

To obtain lower-case characters, one can use:

collate(96 + MyNumbers, 96 + MyNumbers)

Here's a quick example:

data _null_;
    do MyNumbers = 1 to 26;
          MyLettersUpper = collate(64 + MyNumbers, 64 + MyNumbers);
          MyLettersLower = collate(96 + MyNumbers, 96 + MyNumbers);
          put MyNumbers MyLettersUpper MyLettersLower;
    end;
run;

    1 A a
    2 B b
    3 C c
    4 D d
    5 E e
    6 F f
    7 G g
    8 H h
    9 I i
    10 J j
    11 K k
    12 L l
    13 M m
    14 N n
    15 O o
    16 P p
    17 Q q
    18 R r
    19 S s
    20 T t
    21 U u
    22 V v
    23 W w
    24 X x
    25 Y y
    26 Z z
    NOTE: DATA statement used (Total process time):
          real time           0.03 seconds
          cpu time            0.03 seconds

Upvotes: 1

BellevueBob
BellevueBob

Reputation: 9618

Your method to try and translate is a good attempt, but it will not really work. Here is a simple solution:

DATA CHAR_VALUE;
  retain all_chars 'ABCDEFGHIJKLMMOPQRSTUVXXYZ';
  set XYZ;

  length CHAR_2_NUM $200;
  CHAR_2_NUM = ' ';
  NUM_CHAR = 0;

  do i=1 to length(MY_VAR_CHAR);
     if i=1 then CHAR_2_NUM = substr(MY_VAR_CHAR,i,1);
            else CHAR_2_NUM = trim(CHAR_2_NUM) || ' ' || substr(MY_VAR_CHAR,i,1);
     NUM_CHAR + index(all_chars,substr(MY_VAR_CHAR,i,1));
     end;
  drop i all_chars;
RUN;

This takes advantage of the fact that the indexed position of each character of your source variable in the all_chars variable corresponds to the mapping you desired.

UPDATED to also create your CHAR_2_NUM variable, which I overlooked in the original question.

Upvotes: 0

Joe
Joe

Reputation: 63424

RANK will give the ASCII numeric value underlying a character; so A=65, B=66, Z=90, a=97, z=122.

So this should work (if you want only the uppercase values - not a different value for a than A):

data test;
charval='CAB';
do _t=1 to length(Charval);
    numval=sum(numval,rank(char(upcase(charval),_t))-64);
end;
put _all_;
run;

Another option (Based on the comments below), is to build an informat with the relationships between letter and value. My loop iterates over each character A to Z, you can then put whatever value you want for each letter as label (I just put 1,2,3,4... but label= will change that).

data fmts;
retain fmtname 'CHARNUM' type 'i';
do _t=65 to 90;
start=byte(_t); *the character, so byte(65)='A';
label=_t-64;    *the resulting number;
output;
end;
run;
proc format cntlin=fmts;
quit;

data test;
charval='CAB';
do _t=1 to length(Charval);
    numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
end;
put _all_;
run;

Finally, if you want to be able to construct this in the same datastep, you could construct the relationships in a hash table and look up the result. I can explain that if desired, though I'd like to see a more detailed example of what you want to do in terms of defining the relationship between a letter and its code.

If you need to see the intermediate values, you can do that by inserting a CAT function in the loop- I recommend CATX:

data test;
charval='CAB';
format intermed $100.;
   do _t=1 to length(Charval);
        numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
        intermed=catx('|',intermed,input(char(upcase(charval),_t),CHARNUM.)); *or the RANK portion from earlier;
    end;


put _all_;
run;

That would give you 3|1|2, which you could then do math on via SCAN:

do _t = 1 to countc(intermed,'|')+1;
  numval2 = sum(numval2,scan(intermed,_t,'|'));
end;

Upvotes: 2

Related Questions