Is the md5 function safe to use for merging datasets?

Question

We are about to promote a piece of code which uses the SAS md5() hash function to efficiently track changes in a large dataset.

format md5 $hex32.;
md5=md5(cats(of _all_));

As per the documentation:

The MD5 function converts a string, based on the MD5 algorithm, into a 128-bit hash value. This hash value is referred to as a message digest (digital signature), which is nearly unique for each string that is passed to the function.

At approximately what stage does 'nearly unique' begin to pose a data integrity risk (if at all)?

Stig Eide · Accepted Answer

I have seen an example where the md5 comparison goes wrong. If you have the values "AB" and "CD" in the (two columns of the) first row and "ABC" and "D" in the second row, they got the same md5 value. See this example:

data md5;
  attrib a b length=$3 informat=$3.;
  infile datalines;
  input a b;
  format md5 $hex32.;
  md5=md5(cats(of _all_));
datalines;
AB CD
A BCD
;run;

This is, of course, because the CATS(of _all_) will concatinate and strip the variables (converting numbers to string using the "best" format), without a delimiter. If you use CAT instead , this will not happen because the leading and trailing blanks are not removed. This error is not very far fetched. If you have missing values, then this could occur more often. If, for example, you have a lot of binary values in text variables, some of which are missing, it could occur very often.

One could do this manually, adding a delimiter in between the values. Of course, you would still have the case when you have ("AB!" and "CD") and ("AB" and "!CD") and you use "!" as delimiter...

Is the md5 function safe to use for merging datasets?

Answers (2)

Related Questions