Pacane
Pacane

Reputation: 21461

Hash function for floats

I'm currently implementing a hash table in C++ and I'm trying to make a hash function for floats...

I was going to treat floats as integers by padding the decimal numbers, but then I realized that I would probably reach the overflow with big numbers...

Is there a good way to hash floats?

You don't have to give me the function directly, but I'd like to see/understand different concepts...

Notes:

  1. I don't need it to be really fast, just evenly distributed if possible.

  2. I've read that floats should not be hashed because of the speed of computation, can someone confirm/explain this and give me other reasons why floats should not be hashed? I don't really understand why (besides the speed)

Upvotes: 27

Views: 37279

Answers (8)

0kcats
0kcats

Reputation: 745

I believe it is important to remember the following. When you implement a hash function to be used in a hash table, you must also provide the equality operator to go together, and they must be in sync.

Suppose you have implemented a hash function like this:

uint32_t hash(float f);

and provided an equality operator too:

bool equal(float f1, float f2);

You need to ensure that they behave like this:

if hash(v1) != hash(v2) then equal(v1, v2) == false and also

if equal(v1, v2) == true then hash(v1) == hash(v2).

As mentioned in other posts, floating point types have special values that mess up the hashing, such as -0, +0, quiet NAN, signaling NAN. These have different binary representation. If you where to choose a hash function that casts bits of float to an 32 bit int, then you cannot use only regular boolean operator == to implement equal(), because std::bit_cast<float, uint32_t>(-0) != std::bit_cast<float, uint32_t>(+0), but -0 == +0. At the same time std::bit_cast<float, uint32_t>(QNAN) == std::bit_cast<float, uint32_t>(QNAN), while QNAN == QNAN is false.

Usually we don't want to distinguish +0 and -0, so we need to filter them out before bit_cast, and also we want to account for accidental NAN value getting into out data, then we would have to check if both values are NAN and return true.

Assuming that you have a good hash for uint32_t, say called hash_uint32, the following combination of hash and equals should work well together:

uint32_t hash(float f) 
{ 
  float normalized_f = f != 0 ? f : 0.0;
  return hash_uint32(std::bit_cast<float, uint32_t>(normalized_f));
}

bool equal(float f1, float f1) 
{
  return f1 == f2 || 
    std::bit_cast<float, uint32_t>(f1) == std::bit_cast<float, uint32_t>(f2);
}

These ensure that -0 and +0 are treated as same number and also that strange NAN values would not break your hash table forever.

Upvotes: 2

user502187
user502187

Reputation:

Because of the IEEE byte ordering the Java Float.hashCode() and Double.hashCode() do not give good results. This problem is wellknown and can be adressed by this scrambler:

class HashScrambler {

    /**
     * https://sites.google.com/site/murmurhash/
     */
    static int murmur(int x) {
        x ^= x >> 13;
        x *= 0x5bd1e995;
        return x ^ (x >> 15);
    }

}

You then get a good hash function, which also allows you to use Float and Double in hash tables. But you need to write your own hash table that allows a custom hash function.

Since in a hash table you need also test for equality, you need an exact equality to make it work. Maybe the later is what President James K. Polk intends to adress?

Upvotes: 0

authorized
authorized

Reputation: 51

If you're interested, I just made a hash function that uses floating point and can hash floats. It also passes SMHasher ( which is the main bias-test for non-crypto hash functions ). It's a lot slower than normal non-cryptographic hash functions due to the float calculations.

I'm not sure if tifuhash will become useful for all applications, but it's interesting to see a simple floating point function pass both PractRand and SMHasher.

The main state update function is very simple, and looks like:

function q( state, val, numerator, denominator ) {
  // Continued Fraction mixed with Egyptian fraction "Continued Egyptian Fraction"
  // with denominator = val + pos / state[1]
  state[0] += numerator / denominator;
  state[0] = 1.0 / state[0];

  // Standard Continued Fraction with a_i = val, b_i = (a_i-1) + i + 1
  state[1] += val;
  state[1] = numerator / state[1];
}

Anyway, you can get it on npm Or you can check out the github

Using is simple:

const tifu = require('tifuhash');

const message = 'The medium is the message.';
const number = 333333333;
const float = Math.PI;

console.log( tifu.hash( message ), 
  tifu.hash( number ),
  tifu.hash( float ),
tifu.hash( ) );

There's a demo of some hashes on runkit here https://runkit.com/593a239c56ebfd0012d15fc9/593e4d7014d66100120ecdb9

Side note: I think that in future using floating point,possibly big arrays of floating point calculations, could be a useful way to make more computationally-demanding hash functions in future. A weird side effect I discovered of using floating point is that the hashes are target dependent, and I surmise maybe they could be use to fingerprint the platforms they were calculated on.

Upvotes: 1

ideasman42
ideasman42

Reputation: 47968

You can of course represent a float as an int type of the same size to hash it, however this naive approach has some pitfalls you need to be careful of...

Simply converting to a binary representation is error prone since values which are equal wont necessarily have the same binary representation.

An obvious case: -0.0 wont match 0.0 for example. *

Further, simply converting to an int of the same size wont give very even distribution, which is often important (implementing a hash/set that uses buckets for example).

Suggested steps for implementation:

  • filter out non-finite cases (nan, inf) and (0.0, -0.0 whether you need to do this explicitly or not depends on the method used).
  • convert to an int of the same size
    (that is - use a union for example to represent the float as an int, not simply cast to an int).
  • re-distribute the bits, (intentionally vague here!), this is basically a speed vs quality tradeoff. But if you have many values in a small range you probably don't want them to in a similar range too.

*: You may wan't to check for (nan and -nan) too. How to handle those exactly depends on your use case (you may want to ignore sign for all nan's as CPython does).

Python's _Py_HashDouble is a good reference for how you might hash a float, in production code (ignore the -1 check at the end, since that's a special value for Python).

Upvotes: 7

O_Z
O_Z

Reputation: 1563

You can use the std hash, it's not bad:

 std::size_t myHash = std::cout << std::hash<float>{}(myFloat);

Upvotes: 13

President James K. Polk
President James K. Polk

Reputation: 41958

It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.

EDIT:

Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers a and b should be equal to each other because the math says so, you need to pick some relatively small delta > 0, and then you declare a and b to be equal if abs(a-b) < delta, where abs is the absolute value function. For more detail, see this article.

Here is a small example that demonstrates the problem:

float x = 1.0f;
x = x / 41;
x = x * 41;
if (x != 1.0f)
{
    std::cout << "ooops...\n";
}

Depending on your platform, compiler and optimization levels, this may print ooops... to your screen, meaning that the mathematical equation x / y * y = x does not necessarily hold on your computer.

There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.

Upvotes: 17

Goz
Goz

Reputation: 62323

If your hash function did the following you'd get some degree of fuzziness on the hash lookup

unsigned int Hash( float f )
{
    unsigned int ui;
    memcpy( &ui, &f, sizeof( float ) );
    return ui & 0xfffff000;
}

This way you'll mask off the 12 least significant bits allowing for a degree of uncertainty ... It really depends on yout application however.

Upvotes: 12

fredoverflow
fredoverflow

Reputation: 263048

unsigned hash(float x)
{
    union
    {
        float f;
        unsigned u;
    };
    f = x;
    return u;
}

Technically undefined behavior, but most compilers support this. Alternative solution:

unsigned hash(float x)
{
    return (unsigned&)x;
}

Both solutions depend on the endianness of your machine, so for example on x86 and SPARC, they will produce different results. If that doesn't bother you, just use one of these solutions.

Upvotes: 6

Related Questions