Jim
Jim

Reputation: 19552

Why are we using linked list to address collisions in hash tables?

I was wondering why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?
I mean instead of buckets of linked lists, we should use arrays.
If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?

Upvotes: 5

Views: 3697

Answers (5)

Tony Delroy
Tony Delroy

Reputation: 106068

I mean instead of buckets of linked lists, we should use arrays.

Pros and cons to everything, depending on many factors.

The two biggest problem with arrays:

  1. changing capacity involves copying all content to another memory area

  2. you have to choose between:

    a) arrays of Element*s, adding one extra indirection during table operations, and one extra memory allocation per non-empty bucket with associated heap management overheads

    b) arrays of Elements, such that the pre-existing Elements iterators/pointers/references are invalidated by some operations on other nodes (e.g. insert) (the linked list approach - or 2a above for that matter - needn't invalidate these)

...will ignore several smaller design choices about indirection with arrays...

Practical ways to reduce copying from 1. include keeping excess capacity (i.e. currently unused memory for anticipated or already-erased elements), and - if sizeof(Element) is much greater than sizeof(Element*) - you're pushed towards arrays-of-Element*s (with "2a" problems) rather than Element[]s/2b.


There are a couple other answers claiming erasing in arrays is more expensive than for linked lists, but the opposite's often true: searching contiguous Elements is faster than scanning a linked list (less steps in code, more cache friendly), and once found you can copy the last array Element or Element* over the one being erased then decrement size.


If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?

To answer that, let's look at what happens with a great hash function. Packing a million elements into a million buckets using a cryptographic strength hash, a few runs of my program counting the number of buckets to which 0, 1, 2 etc. elements hashed yielded...

0=367790 1=367843 2=184192 3=61200 4=15370 5=3035 6=486 7=71 8=11 9=2
0=367664 1=367788 2=184377 3=61424 4=15231 5=2933 6=497 7=75 8=10 10=1
0=367717 1=368151 2=183837 3=61328 4=15300 5=3104 6=486 7=64 8=10 9=3

If we increase that to 100 million elements - still with load factor 1.0:

0=36787653 1=36788486 2=18394273 3=6130573 4=1532728 5=306937 6=51005 7=7264 8=968 9=101 10=11 11=1

We can see the ratios are pretty stable. Even with load factor 1.0 (the default maximum for C++'s unordered_set and -map), 36.8% of buckets can be expected to be empty, another 36.8% handling one Element, 18.4% 2 Elements and so on. For any given array resizing logic you can easily get a sense of how often it will need to resize (and potentially copy elements). You're right that it doesn't look bad, and may be better than linked lists if you're doing lots of lookups or iterations, for this idealistic cryptographic-hash case.

But, good quality hashing is relatively expensive in CPU time, such that general purpose hash-table supporting hash functions are often very weak: e.g. it's very common for C++ Standard library implementations of std::hash<int> to return their argument, and MS Visual C++'s std::hash<std::string> picks 10 characters evently spaced along the string to incorporate in the hash value, regardless of how long the string is.

Clearly implementation's experience has been that this combination of weak-but-fast hash functions and linked lists (or trees) to handle the greater collision proneness works out faster on average - and has less user-antagonising manifestations of obnoxiously bad performance - for everyday keys and requirements.

Upvotes: 2

Uday Kiran Katuri
Uday Kiran Katuri

Reputation: 71

If is implemented using arrays, in case of insertion it will be costly due to reallocation which in case of linked list doesn`t happen.

Coming to the case of deletion we have to search the complete array then either mark it as delete or move the remaining elements. (in the former case it makes the insertion even more difficult as we have to search for empty slots).

To improve the worst case time complexity from o(n) to o(logn), once the number of items in a hash bucket grows beyond a certain threshold, that bucket will switch from using a linked list of entries to a balanced tree (in java).

Upvotes: 1

leventov
leventov

Reputation: 15263

why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?

I'm almost sure, at least for most from that "many" languages:

Original implementors of hash tables for these languages just followed classic algorithm description from Knuth/other algorithmic book, and didn't even consider such subtle implementation choices.

Some observations:

  • Even using collision resolution with separate chains instead of, say, open addressing, for "most generic hash table implementation" is seriously doubtful choice. My personal conviction -- it is not the right choice.

  • When hash table's load factor is pretty low (that should chosen in nearly 99% hash table usages), the difference between the suggested approaches hardly could affect overall data structure perfromance (as cmaster explained in the beginning of his answer, and delnan meaningfully refined in the comments). Since generic hash table implementations in languages are not designed for high density, "linked lists vs arrays" is not a pressing issue for them.

  • Returning to the topic question itself, I don't see any conceptual reason why linked lists should be better than arrays. I can easily imagine, that, in fact, arrays are faster on modern hardware / consume less memory with modern momory allocators inside modern language runtimes / operating systems. Especially when the hash table's key is primitive, or a copied structure. You can find some arguments backing this opinion here: http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures

    But the only way to find the correct answer (for particular CPU, OS, memory allocator, virtual machine and it's garbage collection algorithm, and the hash table use case / workload!) is to implement both approaches and compare them.

Am I misunderstanding something?

No, you don't misunderstand anything, your question is legal. It's an example of fair confusion, when something is done in some specific way not for a strong reason, but, largely, by occasion.

Upvotes: 1

The reason is, that the expected length of these lists is tiny, with only zero, one, or two entries in the vast majority of cases. Yet these lists may also become arbitrarily long in the worst case of a really bad hash function. And even though this worst case is not the case that hash tables are optimized for, they still need to be able to handle it gracefully.

Now, for an array based approach, you would need to set a minimal array size. And, if that initial array size is anything other then zero, you already have significant space overhead due to all the empty lists. A minimal array size of two would mean that you waste half your space. And you would need to implement logic to reallocate the arrays when they become full because you cannot put an upper limit to the list length, you need to be able to handle the worst case.

The list based approach is much more efficient under these constraints: It has only the allocation overhead for the node objects, most accesses have the same amount of indirection as the array based approach, and it's easier to write.

I'm not saying that it's impossible to write an array based implementation, but its significantly more complex and less efficient than the list based approach.

Upvotes: 1

BitTickler
BitTickler

Reputation: 11875

Strategy 1

Use (small) arrays which get instantiated and subsequently filled once collisions occur. 1 heap operation for the allocation of the array, then room for N-1 more. If no collision ever occurs again for that bucket, N-1 capacity for entries is wasted. List wins, if collisions are rare, no excess memory is allocated just for the probability of having more overflows on a bucket. Removing items is also more expensive. Either mark deleted spots in the array or move the stuff behind it to the front. And what if the array is full? Linked list of arrays or resize the array?

One potential benefit of using arrays would be to do a sorted insert and then binary search upon retrieval. The linked list approach cannot compete with that. But whether or not that pays off depends on the write/retrieve ratio. The less frequently writing occurs, the more could this pay off.

Strategy 2

Use lists. You pay for what you get. 1 collision = 1 heap operation. No eager assumption (and price to pay in terms of memory) that "more will come". Linear search within the collision lists. Cheaper delete. (Not counting free() here). One major motivation to think of arrays instead of lists would be to reduce the amount of heap operations. Amusingly the general assumption seems to be that they are cheap. But not many will actually know how much time an allocation requires compared to, say traversing the list looking for a match.

Strategy 3

Use neither array nor lists but store the overflow entries within the hash table at another location. Last time I mentioned that here, I got frowned upon a bit. Benefit: 0 memory allocations. Probably works best if you have indeed low fill grade of the table and only few collisions.

Summary

There are indeed many options and trade-offs to choose from. Generic hash table implementations such as those in standard libraries cannot make any assumption regarding write/read ratio, quality of hash key, use cases, etc. If, on the other hand all those traits of a hash table application are known (and if it is worth the effort), it is well possible to create an optimized implementation of a hash table which is tailored for the set of trade offs the application requires.

Upvotes: 1

Related Questions