Arek' Fu
Arek' Fu

Reputation: 857

Hashing an unordered sequence of small integers

Background

I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:

  1. it is of length 12;
  2. the order of the sequence elements does not matter;
  3. no element appears twice in the same sequence;
  4. all elements are smaller than about 300.

Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.

I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.

Question

I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.

Any suggestions?

Upvotes: 13

Views: 5069

Answers (5)

Arek' Fu
Arek' Fu

Reputation: 857

I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.

Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.

import itertools
import random
import sys

def random_combination(iterable, r):
    "Random selection from itertools.combinations(iterable, r)"
    pool = tuple(iterable)
    n = len(pool)
    indices = sorted(random.sample(xrange(n), r))
    return tuple(pool[i] for i in indices)

mask_size = 64
mask_size_over_2 = mask_size/2

nmasks = 300

suffix='UL'

print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
    combo = random_combination(xrange(mask_size),mask_size_over_2)
    mask = 0;
    for j in combo:
        mask |= (1<<j);
    if(i<nmasks-1):
        print '\t' + str(mask) + suffix + ','
    else:
        print '\t' + str(mask) + suffix + ' };'

The C++ array generated by the script is used as follows:

typedef int_least64_t HashType;

const int maxTableSize = 300;

HashType mask[maxTableSize] = {
  // generated array goes here
};

inline HashType xorrer(HashType const &l, HashType const &r) {
  return l^mask[r];
}

HashType hashConfig(HashType *sequence, int n) {
  return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}

This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.

Upvotes: 2

Evgeny Kluev
Evgeny Kluev

Reputation: 24647

You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.

This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.


This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.

We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.

If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.

If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.

Upvotes: 4

Jim Balter
Jim Balter

Reputation: 16406

Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.

If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).

Upvotes: 4

Konrad Rudolph
Konrad Rudolph

Reputation: 545558

I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.

std::size_t hash(int (&arr)[12]) {
    return std::accumulate(arr, arr + 12, 0);
}

Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of ifelse statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).

EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:

struct pow2 {
    int operator ()(int n) const { return n * n; }
};

std::size_t hash(int (&arr)[12]) {
    int raised[12];
    std::transform(arr, arr + 12, raised, pow2());
    return std::accumulate(raised, raised + 12, 0);
}

Upvotes: 4

Kerrek SB
Kerrek SB

Reputation: 476990

Here's a basic idea; feel free to modify it at will.

  1. Hashing an integer is just the identity.

  2. We use the formula from boost::hash_combine to get combine hashes.

  3. We sort the array to get a unique representative.

Code:

#include <algorithm>

std::size_t array_hash(int (&array)[12])
{
    int a[12];
    std::copy(array, array + 12, a);
    std::sort(a, a + 12);

    std::size_t result = 0;

    for (int * p = a; p != a + 12; ++p)
    {
        std::size_t const h = *p; // the "identity hash"

        result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
    }

    return result;
}

Update: scratch that. You just edited the question to be something completely different.

If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.

You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:

uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);

(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)

Upvotes: 6

Related Questions