Allan Bowe
Allan Bowe

Reputation: 12691

Is the md5 function safe to use for merging datasets?

We are about to promote a piece of code which uses the SAS md5() hash function to efficiently track changes in a large dataset.

format md5 $hex32.;
md5=md5(cats(of _all_));

As per the documentation:

The MD5 function converts a string, based on the MD5 algorithm, into a 128-bit hash value. This hash value is referred to as a message digest (digital signature), which is nearly unique for each string that is passed to the function.

At approximately what stage does 'nearly unique' begin to pose a data integrity risk (if at all)?

Upvotes: 3

Views: 1623

Answers (2)

Stig Eide
Stig Eide

Reputation: 1062

I have seen an example where the md5 comparison goes wrong. If you have the values "AB" and "CD" in the (two columns of the) first row and "ABC" and "D" in the second row, they got the same md5 value. See this example:

data md5;
  attrib a b length=$3 informat=$3.;
  infile datalines;
  input a b;
  format md5 $hex32.;
  md5=md5(cats(of _all_));
datalines;
AB CD
A BCD
;run;

This is, of course, because the CATS(of _all_) will concatinate and strip the variables (converting numbers to string using the "best" format), without a delimiter. If you use CAT instead , this will not happen because the leading and trailing blanks are not removed. This error is not very far fetched. If you have missing values, then this could occur more often. If, for example, you have a lot of binary values in text variables, some of which are missing, it could occur very often.

One could do this manually, adding a delimiter in between the values. Of course, you would still have the case when you have ("AB!" and "CD") and ("AB" and "!CD") and you use "!" as delimiter...

Upvotes: 3

Joe
Joe

Reputation: 63424

MD5 has 2^128 distinct values, and from what I've read at 2^64 different values (that's 10^20 or so) you begin to have a high likelihood of finding a collision.

However, as a result of how MD5 is generated, you have some risks of collisions from very similar preimages which only differ in as little as two bytes. As such, it's hard to say how risky this would be for your particular process. It's certainly possible for a collision to occur on as few as two messages. It's not likely. Does saving [some] computing time benefit you enough to outweigh a small risk?

Upvotes: 2

Related Questions