Generate unique ID from multiple values with fault tolerance

Question

Given some values, I'd like to make a (pretty darn) unique result.

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plnmjfys'));
//now $unique1 == "sqef3452y";

I also need something that's pretty close to return the same result. In this case, 20% of the values is missing.

$unique2 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8'));
//also $unique2 == "sqef3452y";

I'm not sure where to begin with such an algorithm but I have some assumptions.

I assume that the more values given, the more accurate the resulting ID – in other words, using 20 values is better than 5.
I also assume that a confidence factor can be calculated and adjusted.

What would be nice to have is a weight factor where one can say 'value 1 is more important than value 3'. This would require a multidimensional array for input instead of one dimension.

I just mashed on the keyboard for these values, but in practice they may be short or long alpha numeric values.

p.marino · Accepted Answer

Your two requirements seem a bit contradictory. If the last 20% of the array is non-significant (i.e. you want to get the same result if it is equal '0plnmjfys' or it is null) then why do you want to include it in the first place?

First step is to clarify what you want to disambiguate on. If it is not significant, just drop it.

Once you have decided this, you have to ask yourself if you expect two "close" results to have "close" IDs... i.e. maybe you want

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plnmjfys'));
//now $unique1 == "sqef3452y";

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plSsa45'));
//now $unique1 == "sqef3452k";

The latter is trickier, because most unique id generators use hashes (you may want to look these up, too) so two very similar strings can return wildly different results.

If you want to ensure Uniqueness and don't care to have "closeness" in your results, just calculate the hash of the concatenated string, or an hash for each input string and concatenate the hashcodes.

If you want to privilege "closeness" you may calculate hashes for the most relevant parts and apply a Soundex algorithm or something similar for the rest of the less-relevant parts.

Just remember the you have conflicting requirements in this: Unique IDs try very hard to give (wildly) different codes for strings, even if the only difference is one character in a 1000-chars string.

Closeness (this string is "more or less the same" as this second string) tries to do the exact opposite, and will hopefully return the same code for two: quoting wikipedia about the Soundex algorithm:

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261".

So... which is which? Do you think that using hashes for the first 4 elements (in your example) and Soundex for the least significant 20% in your example work?

This would probably result (getting back to your example) in something like:

$unique2 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8',));
//now $unique2 == "AB67R45-000000";

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plSsa45'));
//now $unique2 == "AB67R45-012000";

Generate unique ID from multiple values with fault tolerance

Answers (2)

Related Questions