Heychsea
Heychsea

Reputation: 133

How to peek arbitrary value from HashSet quickly?

Currently, when I want to take any value from my Hashset, I do it this way:

my_set.iter().next().unwrap();

Compared to the first or last methods from BTreeSet, it takes a very long time and my programs suffer greatly from this. Also, for performance reasons, I can't use a BTreeSet as it slows down my program considerably.

Is there any way to get any value from my set faster than what I'm using?

Upvotes: 0

Views: 393

Answers (1)

SOFe
SOFe

Reputation: 8214

The best possible way is to maintain a low load factor of the hashtable, at the cost of higher risk of hash collision. Alternatively, if you have knowledge over which entries are more likely to have some value, maintain a small index of those entries. Otherwise this is not possible to improve.

The following describes an intuitive proof why this is not possible.

First let's review the structure of a HashSet. A HashSet is based on a hashtable keyed by hash values. The following uses this hashtable taken from Wikipedia as example:

Assume there exists an efficient algorithm to obtain an arbitrary entry from a hashtable.

Consider the case where we inserted the three entries in the example, then call remove("John Smith") and remove("Lisa Smith"). Now we run this imaginary algorithm and obtain 521-9655. How is this done? Since hash values are assumed to be uniformly distributed, trying to probe entry 00, 01, ... should perform as efficiently as any other algorithm assuming no other information is known. Then we see the worst case, where we need to probe O(n) entries (in this example, 15 probes) to find an arbitrary entry. Note that this n is the number of hashtable entries, which is linearly correlated to the size of the HashSet by the hashtable load factor (or the all-time maximum size, depending on how the implementation shrinks and rebuilds the hashtable when too many items are removed).

So to obtain a faster algorithm, we must maintain other information about the hashtable rather than just the original implementation. Consider the case where we index f(n) pointers that might have entries inserted. How is this index maintained? Perhaps we perform some operations on insert() or remove(). It might be trivial to update the index when entries are inserted, but if f(n) (< n) entries are removed consecutively, our index becomes empty, and we can't fill anything more into the index unless we shift the cost of the peeking operation to remove() operation. So if we search starting from these pointers, our imaginary algorithm can reach O(n / f(n)) performance. But what is f(n)? If f(n) = O(n), we are basically maintaining a new collection in addition to the HashSet, which pretty much defeats the point of using the HashSet (why don't you just use a BTreeSet in that case?), since we basically shifted the cost of searching arbitrary entry to the insert/remove operations. If f(n) = O(1), O(n / f(n)) = O(n), which means algorithm basically has no improvement. Analogous argument applies to other variants of f(n).

To conclude, with the assumption that the we don't know what is more likely to get inserted/removed and hash keys are uniformly distributed, the performance of peeking an arbitrary value must either be O(n) or otherwise affect the performance of insert()/remove() in a magnitude.

(This conclusion might be useful. A simple suggestion is to lazy calculate the result assuming remove() calls are significantly less frequent compared to searching arbitrary values)

Upvotes: 1

Related Questions