Robert Penridge
Robert Penridge

Reputation: 8513

Large hex value to numeric equivalent in SAS

Overview:

What I have: 81A8221704654BAC9B24CDFE5A834541 (hex string)

What I want: 172343408754022659539630353407870715201 (decimal equivalent as string)


Detail:

I am exporting data from SAS to a 3rd party platform that performs better when the unique key is in numeric format. The key to the data I'm using is currently stored as a 32 char string containing a hex value, e.g. 81A8221704654BAC9B24CDFE5A834541.

The issue I'm facing is that the largest integer value available in SAS on Linux is 9007199254740992 in decimal (or 20000000000000 in hex). The 32 character hex value when converted to decimal will be much larger and any attempt to do conversion using numeric values will lose precision.

The only two approaches I can think of, are to either write some functions that do addition of two strings (ugly+slow), or perhaps someone has a clever way using some bitwise logical functions?

Any suggestions?

Upvotes: 2

Views: 827

Answers (2)

Robert Penridge
Robert Penridge

Reputation: 8513

Ok, here goes... I tested four approaches:

  1. Ugly+Slow. My own SAS code that implements the ability to convert a 32 char string containing a hex value into a very big integer.
  2. Process externally to SAS then join the data back. I chose Java for it's ubiquity as well as its existing support for BigIntegers, and the resulting simple code.
  3. Run some Python code from inside a SAS proc fcmp function to perform the conversion. The FCMP function can then be used within a datastep.
  4. (per Richard's answer) Use proc groovy to perform the conversion.

Results:

  • My SAS implementation took ~15 minutes to take the same hex string 1 million times and convert it to a big integer (stored as a string).

  • The java implementation took ~3 seconds to generate 1 million hex strings and convert them to big integers. This doesn't include any additional time that would be required to export the data, and join it back.

  • The proc fcmp + python solution took ~90 seconds to process 1 million UUIDs or hex string equivalents and convert them all within a SAS datastep.

  • Richard's proc groovy solution took ~10 seconds to process 1 million observations including writing them to disk and reading them back in. Joining these back to a wider dataset would incurr additional time/IO cost.

With the huge disparity in timing, the vanilla SAS option is clearly not the solution use unless you don't have the ability to implement any of the other solutions. The one advantage it does have is that it's all done natively within SAS - so if your admins have your system on lockdown, you can still achieve the conversion, albeit slower.

Between the other 3 solutions, I think the best solution depends on your use case:

  • The proc fcmp solution has the advantage that it makes for cleaner / self-contained SAS code that's easy to maintain/read/use and runs from within a datastep (something none of the other solutions can claim), but at the cost of running a little slower. I think this would be my preferred choice if I wasn't dealing with big-data performance issues.
  • The proc groovy solution is also nice in that all of the code can live within SAS and no external programs/JAR files are required. The downside is that it does incur the additional complexity of having to dump the hex values to disk, convert them, read them back in, and finally join back to the original dataset. Even with the additional steps, depending on the size/width of the original dataset, this is likely to be faster than the above 2 options.
  • The Java solution to me doesn't offer any benefits over the proc groovy solution . It may run marginally faster (untested), but even if it did, it requires compilation of the JAR file, and existing external .java source file, etc.. so leaves a bit more of a mess lying around.

Disclaimer:

My expertise isn't low level computer science coding - I'm sure there's ways to improve my vanilla SAS code approach but I don't know how so if you can suggest any improvements feel free to chime in.

#1 - SAS Implementation

The approach I took in SAS was as follows:

  1. Slice the 32 character have string into 4 8-character strings.
  2. Convert each of these 8 character strings from hex to decimal. These are small enough that SAS is able to store them as numeric values without losing precision and we can make use of the existing hex informat.
  3. Multiply each of these 4 numerics by the appropriate base-16 exponents to 'resize' them accordingly. The exponents were precalculated outside of SAS using a big number calculator as 16^16 and 16^24 are too big for SAS to handle. To multiply these numbers together I created an FCMP function called multiply_strings.
  4. Add the 4 resulting strings from step 3 together using an FCMP function I created called add_strings.

Below is some test code that performs the above steps and shows the workings, the steps are commented:

data _null_;  
  length front back final  $80;
  
  * CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
  exp1 = '1';                             * 16 ** 0;
  exp2 = '4294967296';                    * 16 ** 8;
  exp3 = '18446744073709551616';          * 16 ** 16;
  exp4 = '79228162514264337593543950336'; * 16 ** 24; 
  
  have = '81A8221704654BAC9B24CDFE5A834541';
  correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ; 
  
  * STEP1;
  hex1 = upcase(substr(have,25,8));
  hex2 = upcase(substr(have,17,8));
  hex3 = upcase(substr(have, 9,8));
  hex4 = upcase(substr(have, 1,8));
  
  *STEP2;
  part1 = input(hex1,hex8.);
  part2 = input(hex2,hex8.);
  part3 = input(hex3,hex8.);
  part4 = input(hex4,hex8.);
  
  *STEP3;
  product1 = part1; 
  product2 = multiply_strings(put(part2,32.0), exp2, 0); 
  product3 = multiply_strings(put(part3,32.0), exp3, 0); 
  product4 = multiply_strings(put(part4,32.0), exp4, 0); 
 
  * STEP4;
  * SHOULD CHANGE ADD STRINGS TO ACCEPT 4 PARAMS BUT WILL SEE HOW PERFORMANCE COMPARES FIRST;
  front  = add_strings(product1, product2); 
  back   = add_strings(product3, product4); 
  final  = add_strings(front , back  ); 
  
 
  put @3 final=;
  put correct=;
  put '-';
  put have=;
  put hex1=;
  put hex2=;
  put hex3=;
  put hex4=;
  put '-';
  put part1=;
  put part2=;
  put part3=;
  put part4=;
  put '-';
  put product1=;
  put product2=;
  put product3=;
  put product4=;
  put '-';
  put exp1=;
  put exp2=;
  put exp3=;
  put exp4=;
%runquit;

And the results from the above step:

   final=172343408754022659539630353407870715201
 correct=172343408754022659539630353407870715201
 -
 have=81A8221704654BAC9B24CDFE5A834541
 hex1=5A834541
 hex2=9B24CDFE
 hex3=04654BAC
 hex4=81A82217
 -
 part1=1518552385
 part2=2602880510
 part3=73747372
 part4=2175279639
 -
 product1=1518552385
 product2=11179286665845800960
 product3=1360398897392653722978353152
 product4=172343408752662260631058413017528008704
 -
 exp1=1
 exp2=4294967296
 exp3=18446744073709551616
 exp4=79228162514264337593543950336

The code I used to run it 1 million times was this (I smashed it all together on the off-chance it would run faster, it didn't):

data _null_;  
  length front back final  $80;
  
  * CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
  part1_exp = '1';                             * 16 ** 0;
  part2_exp = '4294967296';                    * 16 ** 8;
  part3_exp = '18446744073709551616';          * 16 ** 16;
  part4_exp = '79228162514264337593543950336'; * 16 ** 24; 
  
  have = '81A8221704654BAC9B24CDFE5A834541';
  correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ; 

  do blah=1 to 1000000; 
    front  = add_strings(input(upcase(substr(have,25,8)),hex8.), multiply_strings(put(input(upcase(substr(have,17,8)),hex8.),32.0), part2_exp, 0)); 
    back   = add_strings(multiply_strings(put(input(upcase(substr(have, 9,8)),hex8.),32.0), part3_exp, 0), multiply_strings(put(input(upcase(substr(have, 1,8)),hex8.),32.0), part4_exp, 0)); 
    final  = add_strings(front , back  ); 
  end;
run;

FCMP function: add_strings()

This function simply performs addition as you would write it out by hand. Obviously we're trying to avoid converting any significant part of it into a SAS numeric. Perhaps chunking it and using numeric addition would be faster but I'd already spent enough time writing it this way and time constraints etc.....

proc fcmp outlib=work.funcs.funcs;

  function add_strings(str1 $, str2 $ ) $;

    length result rev1 rev2 $40 digit $1;
       
    len  = max(lengthn(cats(str1)), lengthn(cats(str2)));
    rev1 = cats(reverse(str1));
    rev2 = cats(reverse(str2));
    max_result_length = len + 1;

    carry = 0;
    result = '';
    do cnt=1 to max_result_length;      
      s      = sum(input(substr(rev1,cnt,1),best.), input(substr(rev2,cnt,1),best.), carry);      
      digit  = mod(s,10);
      carry  = floor(s/10);      
      result = cats(digit, result);
    end;

    /* REMOVE LEADING ZEROS */
    result_len = lengthn(cats(result));
    do first_non_zero=1 to result_len;
      if substr(result, first_non_zero, 1) ne '0' then do;
        leave;
      end;
    end;    
    result = cats(substr(result,min(first_non_zero,result_len),result_len));

    return(result);
  endsub;
run;

FCMP function: multiply_strings()

This function multiplies two large strings containing very big integers together... It should be accurate to any number where the result is no more than 80 characters but I haven't tested the boundary conditions. The third option is simply to print debugging information to the log, set to 1 to print it.

This was a little trickier than the add_strings() function as typically multiplication by hand (at least how I do it) requires you to add numbers together as part of the process. We want to avoid doing this as these numbers will be very large and we would have had to use the add_strings() function to do it (which I already knew was slow from testing).

Instead I found this approach to multiplication which would avoid having to add any numbers together that would be too large for SAS to store: https://www.youtube.com/watch?v=b_xUE4wkVKY . I don't know if this approach has a proper name or not, I couldn't find a lot of information on it, my google-fu was lacking.

proc fcmp outlib=work.funcs.funcs;

  function multiply_strings(str_a $, str_b $, debug ) $;

    length result $80 rev_a rev_b $40 digit $1;
       
    array a    [40]    a1-a40;       * STR_A AS AN ARRAY;
    array b    [40]    b1-b40;       * STR_B AS AN ARRAY;
    array base [80]    base1-base80; * PRODUCT OF STR_A AND STR_B ELEMENTS.;
    array r    [80]    r1-r80;       * RESULT;
    
    **
    ** CANT DYNAMICALLY SIZE ARRAY P UNFORTUNATELY (AFAIK). 
    ** LOTS OF WASTED SPACE AND PROBABLY SLOWER BUT EASIER TO CODE/DEBUG I SUPPOSE. 
    ** 39 LEVELS. 2 ROWS PER LEVEL. 80 CELLS PER ROW.
    *;
    array p [39,2,80] p1-p6240; 
  
    a_len   = lengthn(str_a);
    b_len   = lengthn(str_b);
    rev_a   = reverse(str_a);
    rev_b   = reverse(str_b);
    max_len = max(a_len, b_len);
    
    * INITIALIZE A+B ARRAYS;
    do cnt=41-(max_len) to 40;
      a[cnt] = input(substr(rev_a,41-cnt,1),best.);
      b[cnt] = input(substr(rev_b,41-cnt,1),best.);
      
      if debug then do;
        put cnt= max_len= a[cnt] b[cnt];
      end;
    end;

    * CALCULATE BASE;
    do cnt=41-(max_len) to 40;
      product     = a[cnt] * b[cnt];
      left        = cnt*2-1;
      right       = cnt*2-0;
      base[left]  = floor(product/10);
      base[right] = mod(product,10);
      
      if debug then do;
        put cnt= a[cnt] b[cnt] product= base[left] base[right] left= right=;
      end;
      
    end;

    * CALCULATE PYRAMID.  BOTTOM (LVL=40) OF PYRAMID IS POINTY. REMEMBER EACH LEVEL IS 2 BLOCKS HIGH (MIDDLE DIMENSION OF ARRAY).;
    do lvl=1 to max_len-1;    
      start_cell = 81-(max_len*2)+lvl;
      end_cell   = 80-lvl;      
      cnt_start  = 41-(max_len);
      cnt_end    = 40;      
      cnt        = cnt_start;
      iter       = 0;
      do while (cnt + lvl le cnt_end); * CALCULATE THE PRODUCTS;
        product1       = a[cnt]*b[cnt+lvl];
        product2       = b[cnt]*a[cnt+lvl];
        cell1          = start_cell+(iter*2);
        cell2          = cell1+1;
        p[lvl,1,cell1] = floor(product1/10);
        p[lvl,1,cell2] = mod(product1,10);
        p[lvl,2,cell1] = floor(product2/10);;
        p[lvl,2,cell2] = mod(product2,10);;
        
        if debug then do;
          put lvl= cnt_start= cnt_end= cnt= product1= product2= start_cell= end_cell= cell1= cell2= iter=;
        end;

        cnt  = cnt + 1;
        iter = iter + 1;
      end;      
    end;

    * CALCULATE RESULT. SIMPLY PERFORM ADDITION USING THE BASE AND THE PYRAMID ELEMENTS (BY COLUMN POSITION). CARRY FOR ANY NUMBERS >= 10.;
    carry = 0;
    do cell=80 to 81-(max_len*2) by -1;
    
      r[cell] = sum(base[cell], carry);
                
      do lvl=1 to max_len-1;  
        do row=1 to 2;
          r[cell] = sum(0,r[cell],p[lvl,row,cell]);
        end;
      end;
      
      carry   = floor(r[cell]/10);
      r[cell] = mod(r[cell],10);
    end;

    * OUTPUT THE FINAL CALCULATIONS TO THE LOG;
    if debug eq 1 then do;      
      option linesize=200;
      put a[*];
      put b[*];
      put ' ';
      put base[*];
      put ' ';
      do lvl=1 to max_len-1;
        do row=1 to 2;
          do cell=1 to 80;
            put p[lvl,row,cell] z1. @;
          end;
          put ;
        end;
      end;      
      put ' ';
      put r[*];
    end;
    
    * TRIM MISSING VALUES AND ZEROES FROM THE FRONT OF THE RESULT.;
    result = cats(of r[*]);    
    do first_non_zero=1 to 80;
      if substr(result, first_non_zero, 1) not in (0,.) then do;
        leave;
      end;
    end;    
    result = cats(substr(result,min(first_non_zero,80),80));
    
    return(result);
    
  endsub;
run;

#2 - The JAVA solution

Much more concise =P. Create a file called test.java containing the following:

import java.math.BigInteger;
import java.util.UUID;

public class test {
    public static void main(String[] args)
    {
        int reps = 1000000;
        for (int i = 0; i < reps; i++) {
            String s = UUID.randomUUID().toString().replaceAll("\\-", "");
            BigInteger big = new BigInteger(s, 16);
            /*System.out.println(s);*/
            /*System.out.println(big);*/
        }

    }
}

Compile it from the command line using javac test.java then execute it by java test.

#3 - proc fcmp+Python

This is the nicest approach IMO. The code is fast-enough for most use-cases, it's simple to call and can be run from inside a datastep, and works on both UUID formatted strings and 32 character hex strings.

If you're running a SAS version below SAS9.4M7 then you WILL need to define the MASM2PATH and MAS_PYPATH environment variables in the appropriate sasv9.cfg / sasv9_usermods.cfg file(s).

proc fcmp outlib=work.funcs.funcs;
  /* SYSTEM REQUIREMENTS: SAS94M6. PYTHON2.7 OR ABOVE. MAS_M2PATH AND MAS_PYPATH MUST BE SET AS SYSTEM ENVIRONMENT VARIABLES */
 
  function uuid_to_decimal(str $) $40;
   
    * CREATE THE INNER PYTHON FUNCTION CALL;
    declare object py(python);
    submit into py;
      def u2d( s ):
         "Output: u2d_output"
         import uuid        
         return str(uuid.UUID(s).int)
    endsubmit;  
    
    rc = py.publish(); * PUBLISH THE CODE TO THE PYTHON INTERPRETER;
    rc = py.call("u2d", str); * CALL THE PYTHON FUNCTION FROM SAS;
      
    return(py.results["u2d_output"]);
 
  endsub;
run;
 
option cmplib=work.funcs;
data _null_;
  length test1 test2 $40;
  do cnt=1 to 500000;
    test1 = uuid_to_decimal("a341b0d0-1ec5-11eb-b64d-02b1159b38e9");
    test2 = uuid_to_decimal("a341b0d01ec511ebb64d02b1159b38e9");
  end;
  put test1= test2=;
run;

#4 - proc groovy (thanks to @Richard for this solution)

A faster solution than the FCMP/Python solution but the big downside to me is that you need to dump the values to convert then import and join them back to the original dataset afterwards. The hidden performance cost will then depend on the size and width of the dataset you are working with.

%let work = %sysfunc(pathname(work));

data have;
input hexstr $32.;
do i = 1 to 1000000;
  output;
end;
drop i;
datalines;
81A8221704654BAC9B24CDFE5A834540
;
run;

proc export data=have replace outfile="&work/have.txt" dbms=tab;
  putnames=no;
run;

proc groovy;
  submit "&work"; * PASS IN OUR WORK PATH AS THE FIRST PARAMETER;
    File out = new File(args[0]+"/want.txt");
    out.write "hexstr,decstr\n";

    new File (args[0]+"/have.txt").eachLine { line -> 
      out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n"); 
    }
    
  endsubmit;
quit;

options source;

proc import datafile="&work/want.txt" replace out=want dbms=csv;
run;

Upvotes: 0

Richard
Richard

Reputation: 27498

You can use Java BigInteger.

Groovy

Write data to file and read conversion back

data have;
input hexstr $32.;
datalines;
81A8221704654BAC9B24CDFE5A834540
81A8221704654BAC9B24CDFE5A834541
81A8221704654BAC9B24CDFE5A834542
81A8221704654BAC9B24CDFE5A834543
;

proc export data=have replace outfile='c:\temp\have-hexstr.dat' dbms=tab;
  putnames=no;
run;

proc groovy;
  submit;
    File out = new File("c:\\temp\\want-decstr.dat");
    out.write "hexstr,decstr\n";

    new File ("c:\\temp\\have-hexstr.dat").eachLine { line ->
      out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n");
    }
  endsubmit;
quit;

options source;

proc import datafile='c:\temp\want-decstr.dat' replace out=want dbms=csv;
run;

JavaObj

You could create a jar that contains a Converter class for performing the conversion and call the method from a JavaObj. The jar would have to be added to the SAS session classpath at startup.

public class Converter
{
  public string hexStrToDecStr(hexStr) 
  { 
    return new java.math.BigInteger(hexStr,16).toString();
  }
}

data want;
  set have;
  if _n_ = 1 then do;
    declare javaobj j('Converter');
  end;

  length decStr $50;
  j.callStringMethod('hexStrToDecStr', hexStr, decStr);
run;

Ideal JavaObj

Can't do this because JavaObj can only pass double values to methods :(

data want;
  set have;

  length decStr $50;

  * wont work,  cause 16 passed as double! :(;
  declare javaobj j ('java/math/BigInteger', hexStr, 16);
  j.callStringMethod('toString', decStr);

  j.delete();
run;

Upvotes: 1

Related Questions