selbie
selbie

Reputation: 104474

unordered_map<TYPE, bool> vs. set<TYPE>

What are the actual tradeoffs of using a hash table collection type such as std::unordered_map vs. std::set ?

For something casual I am working on (in C++), I have a set intersection problem of identifying duplicate items from a pair of large lists.

My first assumption was to iterate through the first list and insert each into a std::unordered_map<T, bool> or (std::hash_map), where the value paramater on insertion is always true. Then do lookups in the hash_map for each item in the second list. The working assumption being that each insertion is O(1) and each lookup is also O(1).

Then I started to think perhaps std::set is more appropriate. Some cursory searches online reveal the implementation of std::set is a red/black true and that insertions and/or lookup may be on a running time O(lg n) instead of O(1). (Is that correct?)

I'm assuming the tradeoffs between each might be memory usage and use of a hashing function (vs. straight up comparison). The actual type of the data I am using is just an unsigned int. I could imagine the dynamics of this problem could change based on a more complex type with a different hashing function.

Upvotes: 3

Views: 714

Answers (2)

selbie
selbie

Reputation: 104474

set<> and map<> are typically implemented with a tree data structure, thus incurring a O(lg n) running time for insertion and lookup.

unordered_set<> and unordered_map<> are typically implemented with a hash table structure, thus obtaining O(1) performance for insertion and lookup.

To be determined - I'm not certain why set<> and map<> could be implemented as a combination of a hash table and doubly linked list. Where each element in the hash table encapsulates both the value and pointers to the previous/next nodes that were inserted. That will be a question for another day.

Upvotes: 0

Dimitrios Bouzas
Dimitrios Bouzas

Reputation: 42889

Assuming that you have 2 lists (e.g., L1 and L2) with N and M number of elements respectively. And also that L1 and L2 have unique elements. (i.e., L#(i) != L#(j) for each i != j).


Your 1st algorithm:

step1: Copy elements of L1 into an unordered_map U, is of time complexity:

  • Average case O(N).

  • Worst case O(N^2).

step2: Iterate through the elements of L2 and for each element check if it exists in U.

  • Average case O(M) * O(1) = O(M).

  • Worst case O(M) * O(N) = O(M*N).

Overall:

  • Average case O(N) + O(M), linear complexity.

  • Worst case O(N^2) + O(M*N), quadratic complexity.


Your 2nd algorithm:

step1: Copy elements of L1 into a set S, is of time complexity:

  • Average case O(N) * O(log(N)).

  • Worst case O(N) * O(log(N)).

step2: Iterate through the elements of L2 and for each element check if it exists in S.

  • Average case O(M) * O(log(N)).

  • Worst case O(M) * O(log(N)).

Overall:

  • Average case O(M) * O(log(N)) + O(N) * O(log(N)), linear logarithmic complexity.

  • Worst case O(M) * O(log(N)) + O(N) * O(log(N)), linear logarithmic complexity.


Results:

Asymptotically 1st Algorithm wins in the average case. Loses in the worst case by 2nd algorithm.


Comments:

  1. Proposed Algorithm with use of unordered_set asymptotically is the same in time complexity with the 1st algorithm. In practice is better and faster because you don't have the redundancy of boolean values.
  2. In practice there's more than theoretical complexity due to the fact of cache memory. It seems that data structures with contiguous memory storage of elements, attain better performance than other ones with fragmented memory storage of elements. Herb Sutter explains this effect nicely in this video lecture.
  3. All the above in practice is hocus pocus. Always you have to profile your code in order to determine which algorithm is faster in practice. Eric Brumer explains this nicely in this video lecture.

Upvotes: 3

Related Questions