Reputation: 11

Semi-reversible integer hash (please keep an open mind)

I need to explore the topic of integer hashes for a specific application. I have a few requirements:

integer to integer hash
"semi" reversibility. I know a hash will not be 1-1 reversible, so please try to understand what I have in mind in terms of an n-1 hash. Let's say I have an original domain of numbers 0...n that I hash into a smaller domain 0...k. If I hash using function f(n) = k, then I want something reversible in the sense that I also have an "inverse" g(k) = {n1,n2,n3, ..., nj} are all the possible domain members that hash to k
reasonably even and "randomish" distribution
for my "inverse" function g, I have a tight bound on the size of the set returned, and this size is roughly the same for any given k
fast integer hash

To explain the application a bit here... I am operating in a very memory restricted environment. I intend to not allow collisions. That is, if there is a collision with an existing value in the table, the insert operation just fails. That's ok. I don't need every insert to succeed. I am ready to make that trade off in favor of space and speed. Now the key thing is this, when I store the value in the table I need to absolutely minimize the number of bits represented. What I am hoping for is basically:

If I hash to value k, I can immediately narrow down what I store to a small subset of the original domain. If the hash is "semi reversible" and if I can enumerate all the possible domain elements hashing to k, then I can order them and assign the ordinal to each possibility. Then I would like to store that much smaller ordinal rather than the original value which will require hopefully many fewer bits. Then I should be able to fully reverse this by enumerating to the ith possibility for stored ordinal i.

The importance of the tight bound on the size of he inverse set g(k) is because I need to know how many bits I need to allocate for each ordinal, and I want to keep things relatively simple by allocating the same number of bits to each table entry. Yes. I will probably be working on smaller than a byte values though. The original domain will be of a relatively small size to start with.

I'm interested in any of your thoughts and any examples anyone might have reference to. I think this should be doable, but I would like to get an idea of the range of possible solutions.

Thanks in advance! Mark

Upvotes: 1

Answers (2)

MvG

Reputation: 60888

Shuffle for desired distribution

Apply some bijection in the 0..(n-1) domain to shuffle things a bit. This would be particularly easy if n were a prime number, since in that case you could treat modulo arithmetic as a field, and perform all kinds of nice mathematical functions. One thing which might distribute numbers evenly enough for your needs might be multiplication by a fixed number c, followed by modulo:

a ↦ (c*a) mod n

You'll have to choose c such that it is coprime to n, i.e. gcd(c,n)=1. If n is a prime number, then this is trivial as long as c≠0, and if n were a power of two, then any odd number would still suffice. This coprimality condition ensures the existence of another number d which is the inverse of c, i.e. it satisfies c*d ≡ 1 (mod n) so that multiplication by d will undo the effect of multiplication by c. You might e.g. use BigInteger.modInverse in Java or Wolfram Alpha to compute this number.

If your n is a power of two, then you can avoid the modulo operation (and the time that would take), and instead do simple bit mask operations. But even for other values of n, you can sometimes come up with schemes that avoid a generic division operation. When you choose c (and d with it), you can do so in a way that both c and d have only few non-zero bits. Then multiplication can likely be expressed in terms of bit shifts and additions. Your optimizing compiler should take care of that for you, as long as you make sure these numbers are compile-time constants.

Here is an example which makes this optimization explicit. Note that writing code this way should not be neccessary: usually it should be enough to write things like (25*a)&1023.

// n = 1024
// c = 25 = 16+8+1
// d = 41 = 32+8+1

static unsigned shuffle(unsigned a) {
  return (a + (a << 3) + (a << 4)) & 1023;
}
static unsigned unshuffle(unsigned a) {
  return (a + (a << 3) + (a << 5)) & 1023;
}

Another shuffling approach which would work for the case that n is a power of two is using some combinations of bit shifts, masks and xors to modify the value. This could be combined with the above multiplication approach, either doing bit twiddling before or after the multiplication, or even both. Making a choice depends very much on the actual distribution of values.

Split and store

The resulting value, still in the range 0..(n-1), can be split into two values: one part which is in the range 0..(k-1) and will be called lo, and another in the range 0..(ceil(n/k)-1) which I'll call hi.

lo = a mod k
hi = floor(a/k)

If k is a power of two, you can obtain lo using a bit mask, and hi using a bit shift. You could then use hi to denote a hash bucket, and lo to signify a value to store in that bucket. All values with the same hi value would collide, but their lo part would help retrieving the value actually stored.

If you want to recognize unoccupied slots of your hash map, then you should ensure that one specific lo value (e.g. zero) will be reserved for this purpose in every slot. If you cannot achieve this reservation in the original set of values, then you might want to choose k as a power of two minus one, so that you can store the value of k itself to denote empty cells. Or you could swap the meaning of hi and lo, such that you could tune the value of n to leave out some values. I'll use this in the example below.

Inversion

To invert this whole thing, you take the key hi and the stored value lo, combine them to a value a=k*hi+lo in the range 0..(n-1), then undo the initial shuffling to go back to your original value.

Example

This example is geared to avoid all multiplication and division. It distributes n=4032 values over k=64 slots, with n/k=63 different values plus one special empty value possible for each slot. It does shuffling using c=577 and d=1153.

unsigned char bitseq[50] = { 0 };

int store(unsigned a) {
  unsigned b, lo, hi, bitpos, byteno, cur;
  assert(a < 4032);                        // a has range 0 .. 0xfbf

  // shuffle
  b = (a << 9) + (a << 6) + a + 64;        // range 0x40 ..0x237dbf
  b = (b & 0xfff) + ((b & 0xfff000) >> 6); // range 0x40 .. 0x9d7f
  b = (b & 0xfff) + ((b & 0xfff000) >> 6); // range 0x40 .. 0x11ff
  b = (b & 0xfff) + ((b & 0xfff000) >> 6); // range 0x40 .. 0xfff
  b -= 64;                                 // range 0x00 .. 0xfbf

  // split
  lo = b & 63;                             // range 0x00 .. 0x3f
  hi = b >> 6;                             // range 0x00 .. 0x3e

  // access bit sequence
  bitpos = (lo << 2) + (lo << 1);          // range 0x00 .. 0x17a
  byteno = (bitpos >> 3);                  // range 0x00 .. 0x30
  bitpos &= 7;                             // range 0x00 .. 0x7
  cur = (((bitseq[byteno + 1] << 8) | bitseq[byteno]) >> bitpos) & 0xff;
  if (cur != 0) return 1; // slot already occupied.
  cur = hi + 1;           // range 0x01 .. 0x3f means occupied
  bitseq[byteno] |= (cur << bitpos) & 0xff;
  bitseq[byteno + 1] |= ((cur << bitpos) & 0xff00) >> 8;
  return 0;               // slot was free, value stored
}

void list_all() {
  unsigned b, lo, hi, bitpos, byteno, cur;
  for (lo = 0; lo != 64; ++lo) {
    // access bit sequence
    bitpos = (lo << 2) + (lo << 1);
    byteno = (bitpos >> 3);
    bitpos &= 7;
    cur = (((bitseq[byteno + 1] << 8) | bitseq[byteno]) >> bitpos) & 0x3f;
    if (cur == 0) continue;

    // recombine
    hi = cur - 1;
    b = (hi << 6) | lo;

    // unshuffle
    b = (b << 10) + (b << 7) + b + 64;
    b = (b & 0xfff) + ((b & 0xfff000) >> 6);
    b = (b & 0xfff) + ((b & 0xfff000) >> 6);
    b = (b & 0xfff) + ((b & 0xfff000) >> 6);
    b -= 64;

    // report
    printf("%4d was stored in slot %2d using value %2d.\n", b, lo, cur);
  }
}

As you can see, it is possible to avoid all multiplication and division operations, and all explicit modulo calls as well. Whether the resulting code has more performance than one using a single modulo call per invocation remains to be tested. The fact that you need up to three reduction steps to avoid a single modulo makes this rather costly.

You can watch a demo run of the above code.

Upvotes: 2

Klas Lindbäck

Reputation: 33273

There is no such thing as a free lunch.

If you have an even distribution then g(k1) will have n/k values for each k1. So you end up having to store k*n/k or n values, which happens to be the same number you started with.

You should probably be looking for compression algorithms rather than hash functions. It will improve your google charma.

That said, it is hard to suggest a compression algorithm without knowing the distribution of numbers. If it is truly random, then it will be hard to compress.

Upvotes: 0