morningstar
morningstar

Reputation: 9162

Simple general-purpose hash function for a collection

Please mark as duplicate, but most questions I've found so far are too specific or more complex than I'm looking for. E.g. in "What is a good hash function", the accepted answer seems to be oriented toward hashing strings.

I've recently started programming in .NET, and I find it unfortunate that built-in classes lack the ability to do some basic things like check equality and find their hash. I'm sure they have their design reasons for that; no need to defend .NET. I just want to know how to avoid a significant sidetrack when I need to use a collection as a key to a dictionary. I want, for example, two different List objects containing all equal values to map to the same entry in the dictionary. Out of the box, they don't: the default behavior for List is that a List is not equal to anything but itself, so another instance of a list with the same values is a different key.

Implementing Equals is straightforward. It's the hash function that I am unsure of.

Is there just something provided that I can call in my implementation of GetHashCode?

If I have to write it from scratch, what's a really simple but good enough hash algorithm? I could use SHA1 but I think it would be overkill. I could just xor all the hashes of the items, but I think that would have some nasty collision properties. I don't care if computing hashes is blazingly fast, but I don't want my hash table to slow to linear on data sets with some particular distribution. What I'd like is something so simple that I can memorize it. Bonus if you can explain (or link to) why it works.

Upvotes: 3

Views: 1708

Answers (3)

Jim Mischel
Jim Mischel

Reputation: 134125

Be very careful here. If you create a GetHashCode method for a List<T> (or similar collection), then presumably it'll do something like this:

public override int GetHashCode()
{
    int hash = 13;
    foreach (var t in this)
    {
        // X is an operation (undefined here) that somehow combines
        // the previous hash value and the item's hash value
        hash = hash X t.GetHashCode();
    }
    return hash;
}

(I would suggest something like the Jenkins hash for computing the hash code. Also look into the Wang hash (or bit mixer).)

Unless you compute that value the first time and the cache it, you will end up iterating over all of the items every time GetHashCode is called.

So you've created a GetHashCode and Equals for your collection and you put an instance into a Dictionary. Now you have to be very careful not to change the collection (i.e. don't add or remove any items) or any of the items inside the collection. Otherwise the value of GetHashCode will change, and the dictionary won't work anymore.

I strongly suggest that if you want to use a collection as the key to a dictionary, you make very sure that the collection is immutable.

One other thing to consider. The concept of list equality isn't as simple as you indicate. For example, are the lists [1, 2, 3, 4, 5] and [5, 1, 3, 4, 2] equal? It rather depends on your definition of equality. Certainly A.Union(B) == A.Intersect(B), which means that they're equal if your definition of equality is "contain the same items." But if order matters, then the lists aren't equal.

If your definition is "contain the same items," then the hash code calculation I showed above isn't going to work because hash code computations are order dependent. So if you wanted to compute the hash code of those lists, you'd have to sort them first.

If the lists cannot contain duplicates, then computing equality is a matter of creating a hash set of one list and looking up each item from the other list in that hash set. If the lists can contain duplicates, then you either have to sort them to determine equality, or use some kind of dictionary with a count. And both of those imply that the objects contained in the list will implement some form of equality comparer, etc.

And some definitions of equality don't take duplicates into account at all. That is, [1, 2, 3] would be equal to [3, 3, 3, 2, 1, 1].

Considering the varying differences of equality and the effort it would have taken to allow for those and more in defining the behavior of List<T>, I can understand why whoever designed the collection classes didn't implement value equality. Especially considering that it's pretty uncommon to use a List<T> or similar collection as the key in a dictionary or hash table.

Upvotes: 3

Simon
Simon

Reputation: 10841

A good hash function will work equally well for a string of any bits - not just characters. However, since a collection may:

  1. Not necessarily be in a contiguous block of memory, and
  2. Include portions that you wouldn't want to include in the hash (e.g. pointers from one element of a linked list to another, which would be different for different linked lists that have the same content but which, for this case, you would want to have the same hash value).

... it seems to me that the key question here may be "what is the best way to combine a set of individual hash values to produce a hash value for a collection?".

XORing the hash values of the individual elements in the collection would be a reasonable approach, in my view. The only problem I can immediately see is that it would lead to two collections with the same elements, but included in different orders, hashing to the same value. An algorithm to avoid this problem could look like this:

  1. Find the hash values of the items in the collection.
  2. Create a bitstring by concatenating those hash values in the order the elements appear in the collection.
  3. Use any reasonable hashing algorithm to generate a hash value for that bitstring of hash values.
  4. Use the hash value calculated in the last step as the hash value for the collection.

Upvotes: 0

StilesCrisis
StilesCrisis

Reputation: 16310

In my experience, if you have a collection of things and you want to compute their hash, it is best to compute the hash for each individual object separately; collect all of those hash values into an array. Finally, compute the hash of your array of hash values.

All of the simpler techniques break down relatively quickly. (Like XORing the values together or multiplying by magic numbers and summing--these have all sorts of pathological failure cases.) The one extra array hash you compute at the end is a small cost and pays off overall.

Upvotes: 2

Related Questions