Reputation: 104474
What are the actual tradeoffs of using a hash table collection type such as std::unordered_map vs. std::set ?
For something casual I am working on (in C++), I have a set intersection problem of identifying duplicate items from a pair of large lists.
My first assumption was to iterate through the first list and insert each into a std::unordered_map<T, bool>
or (std::hash_map), where the value paramater on insertion is always true
. Then do lookups in the hash_map for each item in the second list. The working assumption being that each insertion is O(1) and each lookup is also O(1).
Then I started to think perhaps std::set is more appropriate. Some cursory searches online reveal the implementation of std::set is a red/black true and that insertions and/or lookup may be on a running time O(lg n) instead of O(1). (Is that correct?)
I'm assuming the tradeoffs between each might be memory usage and use of a hashing function (vs. straight up comparison). The actual type of the data I am using is just an unsigned int. I could imagine the dynamics of this problem could change based on a more complex type with a different hashing function.
Upvotes: 3
Views: 714
Reputation: 104474
set<> and map<> are typically implemented with a tree data structure, thus incurring a O(lg n) running time for insertion and lookup.
unordered_set<> and unordered_map<> are typically implemented with a hash table structure, thus obtaining O(1) performance for insertion and lookup.
To be determined - I'm not certain why set<> and map<> could be implemented as a combination of a hash table and doubly linked list. Where each element in the hash table encapsulates both the value and pointers to the previous/next nodes that were inserted. That will be a question for another day.
Upvotes: 0
Reputation: 42889
Assuming that you have 2 lists (e.g., L1
and L2
) with N
and M
number of elements respectively. And also that L1
and L2
have unique elements. (i.e., L#(i) != L#(j)
for each i != j
).
step1: Copy elements of L1 into an unordered_map
U, is of time complexity:
Average case O(N)
.
Worst case O(N^2)
.
step2: Iterate through the elements of L2
and for each element check if it exists in U
.
Average case O(M) * O(1) = O(M)
.
Worst case O(M) * O(N) = O(M*N)
.
Overall:
Average case O(N) + O(M)
, linear complexity.
Worst case O(N^2) + O(M*N)
, quadratic complexity.
step1: Copy elements of L1 into a set
S, is of time complexity:
Average case O(N) * O(log(N))
.
Worst case O(N) * O(log(N))
.
step2: Iterate through the elements of L2
and for each element check if it exists in S
.
Average case O(M) * O(log(N))
.
Worst case O(M) * O(log(N))
.
Overall:
Average case O(M) * O(log(N)) + O(N) * O(log(N))
, linear logarithmic complexity.
Worst case O(M) * O(log(N)) + O(N) * O(log(N))
, linear logarithmic complexity.
Asymptotically 1st Algorithm wins in the average case. Loses in the worst case by 2nd algorithm.
unordered_set
asymptotically is the same in time complexity with the 1st algorithm. In practice is better and faster because you don't have the redundancy of boolean values.Upvotes: 3