ojreadmore
ojreadmore

Reputation: 1475

Generate unique ID from multiple values with fault tolerance

Given some values, I'd like to make a (pretty darn) unique result.

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plnmjfys'));
//now $unique1 == "sqef3452y";

I also need something that's pretty close to return the same result. In this case, 20% of the values is missing.

$unique2 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8'));
//also $unique2 == "sqef3452y";

I'm not sure where to begin with such an algorithm but I have some assumptions.

  1. I assume that the more values given, the more accurate the resulting ID – in other words, using 20 values is better than 5.
  2. I also assume that a confidence factor can be calculated and adjusted.

What would be nice to have is a weight factor where one can say 'value 1 is more important than value 3'. This would require a multidimensional array for input instead of one dimension.

I just mashed on the keyboard for these values, but in practice they may be short or long alpha numeric values.

Upvotes: 1

Views: 2368

Answers (2)

p.marino
p.marino

Reputation: 6252

Your two requirements seem a bit contradictory. If the last 20% of the array is non-significant (i.e. you want to get the same result if it is equal '0plnmjfys' or it is null) then why do you want to include it in the first place?

First step is to clarify what you want to disambiguate on. If it is not significant, just drop it.

Once you have decided this, you have to ask yourself if you expect two "close" results to have "close" IDs... i.e. maybe you want

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plnmjfys'));
//now $unique1 == "sqef3452y";

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plSsa45'));
//now $unique1 == "sqef3452k";

The latter is trickier, because most unique id generators use hashes (you may want to look these up, too) so two very similar strings can return wildly different results.

If you want to ensure Uniqueness and don't care to have "closeness" in your results, just calculate the hash of the concatenated string, or an hash for each input string and concatenate the hashcodes.

If you want to privilege "closeness" you may calculate hashes for the most relevant parts and apply a Soundex algorithm or something similar for the rest of the less-relevant parts.

Just remember the you have conflicting requirements in this: Unique IDs try very hard to give (wildly) different codes for strings, even if the only difference is one character in a 1000-chars string.

Closeness (this string is "more or less the same" as this second string) tries to do the exact opposite, and will hopefully return the same code for two: quoting wikipedia about the Soundex algorithm:

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261".

So... which is which? Do you think that using hashes for the first 4 elements (in your example) and Soundex for the least significant 20% in your example work?

This would probably result (getting back to your example) in something like:

$unique2 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8',));
//now $unique2 == "AB67R45-000000";

$unique1 = generate(array('ab034', '981kja7261', '381jkfa0', 'vzcvqdx2993883i3ifja8', '0plSsa45'));
//now $unique2 == "AB67R45-012000";

Upvotes: 1

DGH
DGH

Reputation: 11549

I suggest you read up on random number generators (RNG), seeds, and degrees of randomness.

In general, most software RNGs use a value called a 'seed' to initialize the algorithm. Thereafter, each random number generated is used as the seed for the next iteration. This means that if you always use the same seed (e.g. 1, or 42) you'll always get the same sequence of "random" numbers. Thus, these types of RNGs are often said to be only 'pseudorandom'. For security's sake, the value of the seed is often chosen using something like the current system time in milliseconds or a hardware randomization device in order to reduce the chances of picking the same seed twice in any reasonable period of time.

What you seem to be proposing is an RNG which can take in multiple strings, possibly with weights, and use some formula to compute a seed. Then, you use your seeded RNG to randomly select characters to make a new string. It's interesting, but unfortunately it won't really end up meaningfully more random than just starting with a numerical seed and an existing RNG as described above. Might be fun as an exercise, though!

http://en.wikipedia.org/wiki/Random_number_generation

You might also google 'random string generator' or some such to find more resources on creating random strings.

Upvotes: 0

Related Questions