Reputation: 12507
I have two sets (or maps) and need to efficiently handle their intersection. I know that there are two ways of doing this:
Depending on the sizes either of these two solution is significantly better (have timed it), and I thus need to either switch between these algorithm based on the sizes (which is a bit messy) - or find a solution outperforming both, e.g. using some variant of map.find() taking the previous iterator as a hint (similarly as map.emplace_hint(...)) - but I could not find such a function.
Question: Is it possible to combine the performance characteristics of the two solutions directly using STL - or some compatible library?
Note that the performance requirement makes this different from earlier questions such as Efficient intersection of sets?
Upvotes: 5
Views: 2222
Reputation: 59184
For sets that are implemented as binary trees, there actually is an algorithm that combines the benefits of both the procedures you mention. Essentially, you do a merge like std::set_intersection, but while iterating in one tree, you skip any branches that are all less than the current value in the other.
The resulting intersection takes O(min(n1 log n2, n2 log n1, n1 + n2), which is just what you want.
Unfortunately, I'm pretty sure std::set doesn't provide interfaces that could support this operation.
I've done it a few times in the past though, when working on joining inverted indexes and similar things. Usually I make iterators with a skipTo(x) operation that will advance to the next element >= x. To meet my promised complexity it has to be able to skip N elements in log(N) amortized time. Then an intersection looks like this:
void get_intersection(vector<T> *dest, const set<T> set1, const set<T> set2)
{
auto end1 = set1.end();
auto end2 = set2.end();
auto it1 = set1.begin();
if (it1 == end1)
return;
auto it2 = set2.begin();
if (it2 == end2)
return;
for (;;)
{
it1.skipTo(*it2);
if (it1 == end1)
break;
if (*it1 == *it2)
{
dest->push_back(*it1);
++it1;
}
it2.skipTo(*it1);
if (it2 == end2)
break;
if (*it2 == *it1)
{
dest->push_back(*it2);
++it2;
}
}
}
It easily extends to an arbitrary number of sets using a vector of iterators, and pretty much any ordered collection can be extended to provide the iterators required -- sorted arrays, binary trees, b-trees, skip lists, etc.
Upvotes: 3
Reputation: 1219
With regard to the performance requirement, O(n1 + n2) is in most circumstances a very good complexity so only worth considering if you're doing this calc in a tight loop.
If you really do need it, the combination approach isn't too bad, perhaps something like?
Pseudocode:
x' = set_with_min_length([x, y])
y' = set_with_max_length([x, y])
if (x'.length * log(y'.length)) <= (x'.length + y'.length):
return iterate_over_map_find_elements_in_other(y', x')
return std::set_intersection(x, y)
I don't think you'll find an algorithm that will beat either of these complexities but happy to be proven wrong.
Upvotes: 0
Reputation: 65458
I don't know how to do this using the standard library, but if you wrote your own balanced binary search tree, here is how to implement a limited "find with hint". (Depending on your other requirements, a BST reimplementation could also leave out the parent pointers, which could be a performance win over the STL.)
Assume that the hint value is less than the value to be found and that we know the stack of ancestors of the hint node to whose left sub-tree the hint node belongs. First search normally in the right sub-tree of the hint node, pushing nodes onto the stack as warranted (to prepare the hint for next time). If this doesn't work, then while the stack's top node has a value that is less than the query value, pop the stack. Search from the last node popped (if any), pushing as warranted.
I claim that, when using this mechanism to search successively for values in ascending order, (1) each tree edge is traversed at most once, and (2) each find traverses the edges of at most two descending paths. Given 2*n1 descending paths in a binary tree with n2 nodes, the cost of the edges is O(n1 log n2). It's also O(n2), because each edge is traversed once.
Upvotes: 0
Reputation: 2355
In almost every case std::set_intersection
will be the best choice.
The other solution may be better only if the sets contain a very small number of elements.
Due to the nature of the log with base two.
Which scales as:
n = 2, log(n)= 1
n = 4, log(n)= 2
n = 8, log(n)= 3
.....
n = 1024 log(n) = 10
O(n1*log(n2) is significantly more complex than O(n1 + n2) if the length of the sets is more than 5-10 elements.
There is a reason such function is added to the STL and it is implemented like that. It will also make the code more readable.
Selection sort is faster than merge or quick sort for collections with length less than 20 but is rarely used.
Upvotes: 4