Gary Holiday
Gary Holiday

Reputation: 3582

How to create a unique hash that will match any strings permutations

Given a string abcd how can I create a unique hashing method that will hash those 4 characters to match bcad or any other permutation of the letters abcd?

Currently I have this code

long hashString(string a) {
    long hashed = 0;
    for(int i = 0; i < a.length(); i++) { 
        hashed += a[i] * 7; // Timed by a prime to make the hash more unique?
    } 
    return hashed;
}

Now this will not work becasue ad will hash with bc.

I know you can make it more unique by multiplying the position of the letter by the letter itself hashed += a[i] * i but then the string will not hash to its permutations.

Is it possible to create a hash that achieves this?

Edit

Some have suggested sorting the strings before you hash them. Which is a valid answer but the sorting would take O(nlog) time and I am looking for a hash function that runs in O(n) time.

I am looking to do this in O(1) memory.

Upvotes: 5

Views: 1913

Answers (5)

dxiv
dxiv

Reputation: 17658

Create an array of 26 integers, corresponding to letters a-z. Initialize it to 0. Scan the string from beginning to end, and increment the array element corresponding to the current letter. Note that up to this point the algorithm has O(n) time complexity and O(1) space complexity (since the array size is a constant).

Finally, hash the contents of the array using your favorite hash function.

Upvotes: 5

user3386109
user3386109

Reputation: 34839

Synopsis: store a histogram of the letters in the hash value.

Step 1: compute a histogram of the letters (since a histogram uniquely identifies the letters in the string without regard to the order of the letters).

int histogram[26];
for ( int i = 0; i < a.length(); i++ )
    histogram[a[i] - 'a']++;

Step 2: pack the histogram into the hash value. You have several options here. Which option to choose depends on what sort of limitations you can put on the strings.

If you knew that each letter would appear no more than 3 times, then it takes 2 bits to represent the count, so you could create a 52-bit hash that's guaranteed to be unique.

If you're willing to use a 128-bit hash, then you've got 5 bits for 24 letters, and 4 bits for 2 letters (e.g. q and z). The 128-bit hash allows each letter to appear 31 times (15 times for q and z).

But if you want a fixed sized hash, say 16-bit, then you need to pack the histogram into those 16 bits in a way that reduces collisions. The easiest way to do that is to create a 26 byte message (one byte for each entry in the histogram, allowing each letter to appear up to 255 times). Then take the 16-bit CRC of the message, using your favorite CRC generator.

Upvotes: 2

Matt Timmermans
Matt Timmermans

Reputation: 59253

Easiest way to understand: sort the letters in the string, and then hash the resulting string.

Some variations on your original idea also work, like:

long hashString(string a) {
    long hashed = 0;
    for(int i = 0; i < a.length(); i++) {
        long t = a[i] * 16777619;
        hashed += t^(t>>8);
    } 
    return hashed;
}

Upvotes: 3

hugomg
hugomg

Reputation: 69954

The basic thing you can do is sort the strings before applying the hash function. So, to compute the hash of "adbc" or "dcba" you instead compute the hash of "abcd".

If you want to make sure that there are no collisions in your hash function, then the only way is to have the hash result be a string. There are many more strings than there are 32-bit (or 64-bit) integers so collisions are innevitable (collisions are unlikely with a good hash function though).

Upvotes: 4

ryan
ryan

Reputation: 1084

I suppose you need a hash such that two anagrams will hash to the same value. I'd suggest you sort them first and use any of the common hash function such as md5. I write the following code using Scala:

 import java.security.MessageDigest

 def hash(s: String) = {
    MessageDigest.getInstance("MD5").digest(s.sorted.getBytes)
 }

Note in scala:

 scala> "hello".sorted
 res0: String = ehllo

 scala> "cinema".sorted
 res1: String = aceimn 

Upvotes: 2

Related Questions