Reputation: 8439
I would like to create a function that takes a string and returns a number between 0 and 1. The function should consistently return the same number when given the same string, but other than that the results should have no discernible pattern. The output numbers for any large set of input strings should follow a uniform distribution.
Moreover, I need to generate more than one such function, i.e. when given the string "abc", function A might consistently return 0.593927 while function B consistently returns 0.0162524. I need it to be fast (it's for a numerical simulation), and have reasonably good statistics.
I'm using Python and will settle for answers of the form "here is an easy way to do it using a Python library" or "here is an algorithm that you can implement." If there's no fast way to do it in Python I'll just switch to C instead.
I realise that either of the following two methods will work, but each of them have drawbacks that make me want to look for a more elegant solution.
Store a dictionary
I could just calculate a new random number every time I'm given a new string, and store it in a dictionary to be retrieved if I receive the same string again. However, my application is likely to generate a lot of strings that only appear once, This will eventually result in having to store a very large dictionary in memory. It also makes repeatability more difficult, since even if I use the same seed, I'll generate a different function if I receive the same strings in a different order. For these reasons, it would be much better to consistently compute the random numbers "on the fly".
Use a hash function
I could just call a hash function on the string and then convert the result to a number. The issue of generating multiple functions could be solved by, for example, appending a "seed" string to every input string. However, then I'm stuck with trying to find a hash function with the appropriate speed and statistics. Python's built-in hash is fast but implementation-dependent, and I don't know how good the statistics will be since it's not designed for this type of purpose. On the other hand I could use a secure hash algorithm such as md5, which will have good statistics, but this would be too slow for my application. Hash functions aimed at data storage applications are typically much faster than cryptographically secure ones like md5, but they're designed with the aim of avoiding collisions, rather than producing uniformly distributed output, and these aren't necessarily the same in all cases.
A further note about hash functions
To illustrate the point that avoiding collisions and producing uniform results are different things, consider the following example using Python's built-in hash function:
>>> hash("aaa") % 1000
340
>>> hash("aab") % 1000
343
>>> hash("aac") % 1000
342
>>> hash("aad") % 1000
337
>>> hash("aae") % 1000
336
>>> hash("aaf") % 1000
339
>>> hash("aag") % 1000
338
>>> hash("aah") % 1000
349
>>> hash("aai") % 1000
348
>>> hash("aaj") % 1000
351
>>> hash("aak") % 1000
350
There are no collisions in the above output, but they are also clearly not uniformly distributed, since they are all between 336 and 351, and there is also a definite pattern in the third digit. I realise I could probably get better statistics by doing (hash("aaa")/HASH_MAX)*1000
(assuming I can work out what HASH_MAX
should be), but this should help to illustrate that the requirements for a good hash function are not the same as the requirements for the function I'm looking for.
Some relevant information about the problem
I don't know exactly what the strings are that this algorithm will need to work on, because the strings will be generated by the simulation, but the following are likely to be the case:
They will have a very restricted character set (perhaps just 4 or 5 different symbols).
There will be a lot of unique or rare strings and a few very common ones, of varying length.
There is no upper bound on the lengths of the strings, but short ones are likely to be much more common than long ones. I wouldn't be surprised if I never see one longer than 100 characters, but I don't know for sure. Many of them will just have one to three characters, so it's important that the algorithm is fast for short strings. (But I guess I could use a lookup table for strings less than a certain length.)
Typically, the strings will have large substrings in common - often two strings will differ only by a single character appended to the beginning or end. It's important that the algorithm doesn't tend to give similar output values when the strings are similar.
Upvotes: 3
Views: 785
Reputation: 47000
Lookup3 is reputed to have very good collision properties, which ought to imply uniform distribution of results, and it's also fast. Should be simple to put this in a Python extension.
More generally, if you find a function that does a good job of minimizing hash table collisions and has the speed properties you need, a final conversion from a 32- or 64-bit integer to float is all that's needed. There are many sources on the web and elsewhere of string hashing functions. Check Knuth, for starters.
Addition
One other thing that might be worth trying is to encrypt the string first with a fast 1-1 algorithm like RC4 (not secure, but still close enough to pseudorandom) and then run a trivial hash (h = h + a * c[i] + b) over the cipher text. The RC4 key is the uniquifier.
Upvotes: 1
Reputation: 123501
Use a good random number generator and seed it with the string.
Upvotes: 3
Reputation: 1101
Try to use a Fingerprint such as Rabin fingerprinting.
http://en.wikipedia.org/wiki/Fingerprint_(computing).
If you choose a N-bit finger print you just need to divide the result by 2^N.
Fingerprints are a kind of hash functions which are usually very fast to computer (compare to Cryptographic hash functions like MD5) but not good for cryptographic applications (the key value may be recoverable somehow using its fingerprint)
Upvotes: 1
Reputation: 241901
There's an algorithm in the section on "hashing strings" in the Wikipedia article on universal hashing.
Alternatively, you could just use some built-in hash function; each of your random functions prepends a random (but fixed) prefix to the string before hashing.
Upvotes: 1