Reputation: 1
I am facing a problem where for a number of words, I make a call to a HashMultimap (Guava) to retrieve a set of integers. The resulting sets have, say, 10, 200 and 600 items respectively. I need to compute the intersection of these three (or four, or five...) sets, and I need to repeat this whole process many times (I have many sets of words). However, what I am experiencing is that on average these set intersections take so long to compute (from 0 to 300 ms) that my program takes a very long time to complete if I look at hundreds of thousands of sets of words.
Is there any substantially quicker method to achieve this, especially given I'm dealing with (sortable) integers?
Thanks a lot!
Upvotes: 0
Views: 3049
Reputation: 11
The post http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset describes an implementation of an ordered primitive long set with set operations (union, minus and intersection). To my experience it's quite efficient for dense or sparse value populations.
Upvotes: 0
Reputation: 729
One problem with a bitmap based solution is that even if the sets themselves are very small, but contain very large numbers (or even unbounded) checking bitmaps would be very wasteful.
A different approach would be, for example, sorting the two sets, merging them and checking for duplicates. This can be done in O(nlogn) time complexity and extra O(n) space complexity, given set sizes are O(n).
You should choose the solution that matches your problem description (input range, expected set sizes, etc.).
Upvotes: 1
Reputation: 8392
If you are able to represent your sets as arrays of bits (bitmaps), you can intersect them with AND operations. You could even implement this to run in parallel.
As an example (using jlordo's question): if set1 is {1,2,4} and set2 is {1,2,5}
Then your first set would be represented as: 00010110 (bits set for 1, 2, and 4). Your second set would be represented as: 00100110 (bits set for 1, 2, and 5).
If you AND them together, you get: 00000110 (bits set for 1 and 2)
Of course, if you had a larger range of integers, then you will need more bytes. The beauty of bitmap indexes is that they take just one bit per possible element, thus occupying a relatively small space.
In Java, for example, you could use the BitSet data structure (not sure if it can do operations in parallel, though).
Upvotes: 7