Reputation: 474

Hashing With AVL Trees

I was working on a search program in C. I used hashing data structure. I just stored a word once and from that word I pointed to the strings in which that word exists. So whenever a user give a word, all the strings containing that word will be given.

But instead of using a linked list, I used an AVL Tree in hashing. In short the next node in the key is pointing to root of an AVL tree. This can reduce the time complexity from O(n) to O(log n).

Suppose one node has thousands of strings connected together. Is it good to use hashing with AVL tree. (I also tried Tries Data structure for this.)

Is there is any better algorithm for this?

Upvotes: 2

Answers (3)

ikegami

Reputation: 386396

There's a problem with the hashing algorithm or you have too few buckets. A bucket shouldn't contain more than a handful of keys.

Don't try to support large buckets; fix the actual problem causing the buckets to be large in the first place.

For reference, I loaded the 99,171 words in my system's /usr/share/dict/words into a Perl hash table^[1]. The resulting hash table had 131,072 buckets, no bucket had more than 7 keys, and it requires at most three comparisons to locate an element (or determine that it is missing) for 99% of inputs.

Keys:     99,171
Buckets: 131,072
Distribution:
   Buckets with 0 keys: 61,461
   Buckets with 1 keys: 46,668
   Buckets with 2 keys: 17,547
   Buckets with 3 keys:  4,343
   Buckets with 4 keys:    903
   Buckets with 5 keys:    133
   Buckets with 6 keys:     16
   Buckets with 7 keys:      1

Search for absent keys:
   0 comparisons: 47% of the time    ≤0 comparisons:  47% of the time
   1 comparison:  36% of the time    ≤1 comparison:   83% of the time
   2 comparisons: 13% of the time    ≤2 comparisons:  96% of the time
   3 comparisons:  3% of the time    ≤3 comparisons:  99% of the time
   4 comparisons:  1% of the time    ≤4 comparisons: 100% of the time
   5 comparisons:  0% of the time    ≤5 comparisons: 100% of the time
   6 comparisons:  0% of the time    ≤6 comparisons: 100% of the time
   7 comparisons:  0% of the time    ≤7 comparisons: 100% of the time

Search for present keys:
   1 comparison:  70% of keys        ≤1 comparison:   70% of the keys
   2 comparisons: 23% of keys        ≤2 comparisons:  93% of the keys
   3 comparisons:  5% of keys        ≤3 comparisons:  99% of the keys
   4 comparisons:  1% of keys        ≤4 comparisons: 100% of the keys
   5 comparisons:  0% of keys        ≤5 comparisons: 100% of the keys
   6 comparisons:  0% of keys        ≤6 comparisons: 100% of the keys
   7 comparisons:  0% of keys        ≤7 comparisons: 100% of the keys

By replacing the linked list with an AVL tree, you are just waiting space.

Now, hashes do provide O(1) lookups and amortized O(1) insertions, but they aren't particularly compact. If you're ok with slightly slower lookups (O(log n)) lookups, you could save a lot of space by using some kind of tree instead of the hash.

There are many alternatives. One is a ternary search tree, which is supposedly particularly efficient at storing lots of value with common prefixes.

I used perl -MDevel::Peek -nle'++$h{$_} END{ Dump(%h,0) }' /usr/share/dict/words

Upvotes: 2

templatetypedef

Reputation: 373002

Closed addressing hashing - which is what you’re using - works by picking some hash function, using it to distribute elements across buckets, and then having some data structure per bucket to handle collisions. What you’re proposing (using an AVL tree per bucket) does make the worst-case performance of a lookup asymptotically better, but for it to be worthwhile you need to have a lot of collisions. There’s a nifty result that says that if you have a sufficiently good hash function and drop n objects into an n-slot hash table, the largest bucket has expected size O(log n / log log n), which is a very small number.

If you’re seeing a performance win from switching to a tree to resolve collisions, it probably means that you have too many elements ending up in each bucket. That could mean, for example, that you don’t have enough buckets for the number of elements you have. In that case, increasing the number of buckets would likely be a better solution than resolving collisions more efficiently.

It could also mean that you don’t have a very good hash function, which would cause your strings to cluster in a small number of buckets. In that case, choosing a better hash function would likely give you much better performance wins than using AVL trees.

Upvotes: 3

user149341

Reputation:

Suppose one node has thousands of strings connected together. Is it good to use hashing with AVL tree.

No. At that point, you're hardly even using a hash table anymore -- if you're happy with O(log n) performance, use the tree directly, without the hash table at the top.

If you have "thousands of strings" under a single hash bucket, your hash table is way too small -- ideally, most buckets should have at most one object in them. Use a larger hash table to get the O(1) insert/lookup performance that hash tables can provide.

Upvotes: 3

Hashing With AVL Trees

Answers (3)

Related Questions