philipp
philipp

Reputation: 1813

QHash: Any weak spots performance-wise, besides rehashing? Why not?

This is more of a theoretical question. In addition, I have to admit, that I did some rather sophisticated performance tests about this some time ago, but I cannot find the source code anymore.

Some words about the type of application: I am focussing on scenarios with really big numbers of entries, from 50,000 up to some million, while memory consumption does not really matter.

I fully understand the basic concept of a hash data structure, why it generally has constant access times, and why rehashing is required at some point.

Any possible key is mapped to a certain slot in the hash structure. Of course, collisions may occur, resulting in multiple keys being mapped to the same slot. This is where implementation details come into play. As far as I know, there is some kind of logic using the "next" slot if the initially assigned slot is occupied.

My feeling is, there has to be a weak spot somewhere, performance-wise. Given a really big QHash, filled up just below its capacity, of which keys are then removed randomly, while new keys are added (without ever increaing the total number of stored keys, making sure it does not rehash): I would think, this has to lead to severe performance degration in the long term.

Filling up a QHash just below its capacity, with random values, should result in a considerable amount of key collisions. The lookup of a key affected from collisions requires multiple slots to be inspected, resulting in performance penalties. Removing keys and adding new random keys instead should make things even worse: Contiguous sequences of colliding keys will be fragmented. Collisions occupy 'foreign' slots, forcing a key actually mapped to this slot to be stored somewhere else. This slot might still be free'd later...

To make it short, I would expect, for the given scenario (perfoming deletes and inserts on a QHash which is always kept just below its capacity), performance should degrade in the long run, either because of lookup times are increasing due to increasing disorder, or because of periodical reordering.

However, I had taken some efforts to show this performance degration once (as I said, I cannot find this project anymore, so I stand here barehanded, I'm afraid), and I could not find any.

Is there a special magic in place regarding QHash handling collisions I am not aware of?

Upvotes: 0

Views: 749

Answers (1)

bolov
bolov

Reputation: 75795

tl;dr;

Is there a special magic in place regarding QHash handling collisions I am not aware of

There is no magic. You just misunderstood how hash maps work.

(*) I will treat the concept of hash map, not the specific implementation QHash. And although there are several approaches to handling collisions, what I am describing here is the most common pattern. It is used, among others, by c++ std::unordered_map, Qt QHash, java HashMap.


Your terminology with "slots" it's a bit confusing. I first though that by slot you mean "bucket", but I now think you mean element of a bucket.

So in a hash, colliding keys are stored in a bucket. This can be any container, from vector to list.

Finding a bucket is O(1) and finding a key inside a bucket is O(k), where k is the bucket's length. So key access is constant in best case and linear in worst case.

You seem to assume that the number of buckets somehow increases when the hash map fills it's capacity. Well, there is no such thing as a capacity for a hash map (like there is for a vector for instance). So the situation that you describe: "having a hash with the capacity of let's say 100 and in the worst case all 100 elements collide and are stored in the same bucket" can't happen.

For a hash map you have a measure called "load factor" which is the average number of elements per bucket (size / bucket_count). The hash will increase the number of buckets (and recompute the hashes of every element, repositioning them) when the load factor exceeds a threshold max load factor. Performance is assured first and foremost by the quality of the hash function which must assure that keys are uniformly spread across all buckets. But no matter how good the hash function is, you can still have situations where some buckets are considerably larger then the rest. The fail-safe for that is the mentioned the max load factor.

By consequence, the max load factor achieves two purposes:

  • it acts as a "logical capacity" if you will: it makes the hash naturally increase the buckets count in the scenario that elements are added to the hash uniformly, making all the buckets too large.

  • it acts as a "fail safe" for the hash function. It increases the buckets count in the (rare) scenario that you have multiple keys colliding on a small subset of buckets.

Upvotes: 1

Related Questions