nakiya
nakiya

Reputation: 14403

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:

bool IsInList(int iTest)
{
   //Return if iTest appears in a set of numbers.
}

The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.

p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.

Upvotes: 9

Views: 7342

Answers (9)

Glenn Slayden
Glenn Slayden

Reputation: 18749

  1. (Once, at startup:) Ensure the list of integers is sorted (and contiguous).
  2. (As needed:) Test membership using a binary search.

I don't think you'll do much better than this if the ranges of integers in the list are truly large and arbitrary, as you stated.

Upvotes: 0

Steve Massey
Steve Massey

Reputation: 86

A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.

Upvotes: 0

ticktock
ticktock

Reputation: 133

integers, strings, doesn't matter

http://videolectures.net/mit6046jf05_leiserson_lec08/

After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.

@54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)

It really all depends on what you mod by.

Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.

The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.

Int is one of the datatypes where it is possible to just do:

bool hash[UINT_MAX]; // stackoverflow ;)

And fill it up. If you don't care about negative numbers, then it's twice as easy.

Upvotes: 2

Matthieu M.
Matthieu M.

Reputation: 299730

I would suggest using Bloom Filters in conjunction with a simple std::map.

Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!

A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.

The slight catch is that the answer is... special: Is this element part of the set ?

  • No
  • Maybe (with a given probability depending on the properties of the Bloom Filter)

This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...

What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.

As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.

Upvotes: 3

EvilTeach
EvilTeach

Reputation: 28837

Original Question

After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.

Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1. Generally it was pretty big.

I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.

In you case you are using integers. Technically, they are already hashed, but the range is large.

It may be that you can get what you want, simply by applying the modulus operator. It may be that you need to put a hash function in front of it first.

It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that you can guarantee that the container will work in all cases.

Upvotes: 0

Diego Sevilla
Diego Sevilla

Reputation: 29001

What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.

Upvotes: 2

Tony Delroy
Tony Delroy

Reputation: 106068

It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....

Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.

Upvotes: 0

Donnie
Donnie

Reputation: 46903

A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.

The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.

Upvotes: 1

zildjohn01
zildjohn01

Reputation: 11515

For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.

Wikipedia has example implementations that should be simple enough to translate to C++.

Upvotes: 0

Related Questions