Guybrush
Guybrush

Reputation: 159

Hashing Sequences of Integers

I have to deal with sequences of numbers, where a sequence has the following properties:

Given a sequence, I'd like to know if this sequence has already occurred, that is I want to hash sequences. For example,

[2, 3, 6, 2, 13]

and

[6, 3, 2, 13, 2]

should have the same hash values.

The programming language being used is C.

I know that I could first sort the sequences and then store them in a trie, which is definitely an option. Nevertheless, what would be an appropriate hash function for this purpose?

Upvotes: 7

Views: 6141

Answers (3)

Frank Schwieterman
Frank Schwieterman

Reputation: 24480

Putting perf aside, if you want something easy to review and reliable, I'd go with:

var sample = new [] { 11, 55, 12, 3 };
String.Join(" ", sample.OrderBy(i => i)).GetHashCode()

You may want to add a call to .Distinct() depending on if you're considering a set instead of a sequence.

Upvotes: 0

Ilmari Karonen
Ilmari Karonen

Reputation: 50328

The requirement that

  • the order of elements does not matter

makes me immediately think of something like Zobrist hashing. That is, you'd have a function f mapping integers to random bitstrings, and your hash would be simply the XOR of the bitstrings corresponding to the numbers in your sequence.

Of course, the basic Zobrist hashing described above doesn't satisfy your other requirement that

  • multiple occurrences of elements is allowed

since the XOR operation is its own inverse (i.e. a XOR a = 0 for any a). However, simply replacing XOR with some other ring operation without this property (which, in normal Zobrist hashing, is actually considered desirable), such as n-bit addition, should produce a hash like you want:

unsigned int hash_multiset (int *seq, int n) {
    unsigned int h = 0;
    while (n--) h += f( *seq++ );
    return h;
}

(A minor detail to note about this function is that, if you want to truncate its output, it's slightly better to use the upper than the lower bits. This is because, if the k lowest bits of the hashes of the sequences [a] and [b] collide, then so will the k lowest bits of [a, a], [b, b], [a, b] and so on. For the k highest bits, this is not true, since the lower bits can carry over into the higher ones, producing more "random-looking" output.)

There are various ways to implement the function f. For a limited range of input integers, you could simply use a fixed lookup table of random bitstrings. Alternatively, if you don't know the range of your inputs in advance, you could use another (ordinary) hash table mapping integers to random bitstrings and just build it up "on the fly".

Finally, it's also possibly to implement f without a lookup table, simply by using a fixed function that "looks random enough". One good choice for such a function would be to use a simple and fast block cipher, such as TEA or (on systems with hardware support for it) AES, with the output truncated to your preferred hash length.

Upvotes: 4

Stig Brautaset
Stig Brautaset

Reputation: 2632

How about multiplying all the numbers and the length of the sequence, modulo some reasonably big number? Here's some Scala code that shows the calculation:

val l = List(6, 3, 2, 13, 2)
(l.reduce(_ * _) * l.length) % 10000

This results in: 4680.

Obviously this does not guarantee that if the hashes match, the sequences are unique. (It might not even be a very good approximation!) However, if the hashes do not match, it's guaranteed that the sequences are not the same.

Upvotes: 1

Related Questions