Reputation: 71
I need to do a lot of checks if the intersection set of two sets (one is identical for all checks, the other one changes) is empty or not.
It's ok if the check says (in a small amount of checks) it's not empty, but it is (there can be a second filtering step which is more precise), so false positives are ok. It's not allowed, that I filter out something that defnitly has a not empty intersections, so false negatives are not ok.
So, only a scenario:
{A,B,C,D} <-> {D,E,F} => true
(D in intersection set), never allowed to be false
{A,B,C} <-> {D,E,F} => false
(no intersection set), can also return true in a minor number of checks
For a single element I would use a bloom filter, but I can not find something similar for a set of elements and bloom filter checking element by element would be a possible option, but I'm searching for something better.
Upvotes: 5
Views: 970
Reputation: 71
Thanks a lot for your answers, helped me to come up with a good solution and solve the problem.
The idea is mostly really primitive, but sufficient.
I create two bitsets, one for the changing set and one for the fixed set. Each element of a set is hashed to one bit (so e.g. for a long one bit in 1 to 64) and then combined for a set (mostly a Bloom-Bitset with k=1).
To check if there is a non empty intersection set I simply need to combine the two bitsets with bit-and-operation and check if the result is not 0.
The false-positive rate would be (I think, didn't do the math) worse, but it's good enough for my case.
Example:
[A,B,C] => 0000100010100000
[B,D,F] => 0100000010000100
---------------------- &
0000000010000000 != 0 => true
Upvotes: 2
Reputation: 11406
One optimization would be to keep a list (array for fast lookup) with the min/max values for each set. Then first check in that list if they overlap. If not -> return false - no further checks needed.
S1: a b c
S2: d e f
S1 and S2 -> false (no overlap)
If the sets are sorted and they do overlap, you only have to check the overlapping region.
S1: a b c d e
S2: d e f g h
Only check the 'd e' region
If you need to check the intersection of more than 2 sets, first try to find two non-overlapping sets. If found -> return false. If not, only check the overlapping region for all those sets (which should get smaller with more sets).
S1: a b c d e
S2: d e f g h
S3: g h i
S1 and S2 and S3 -> false (S1 and S3 do not overlap)
If most of the sets span a wide range, you could use another option:
Lets say the maximum number of elements is 6400 (for this example), and every element is, or can be converted to an integer 1-6400.
For each set a small bitmap (64 bit unsigned integer) could be created, with one bit representing 100 items.
For example:
S1: 1,23,80,312,340
S2: 160,184,450
S3: 230,250,340
S1 bitmap: 100100..
S2 bitmap: 010010..
S3 bitmap: 001100..
S1 and S2 -> false
S1 and S3 -> true, only check range 301-400
S1 and S2 and S3 -> false
You could of course use a smaller number than 100 (preferable a power of 2, so you can quickly set the corresponding bit) and use multiple uint64
.
This could even be done in multiple levels (depending on the amount of memory / storage space you're willing to use). For example, first a real quick check on one 64 bit integer (takes one CPU cycle and can easily be done with SQL). Only for those that match check a second level containing maybe 4, 8 or 16 uint64
, with each bit representing a smaller range of values (could also be really fast using SSE/AVX registers). If they still match do a deeper check, but only for the ranges corresponding to the set bits in the result.
Upvotes: 0
Reputation: 3551
You mentioned you're doing it in sql
. So we have smth like this:
PatternSet (ElemId int16 primary key)
: first table with the set to check againstProbablyChangedSets (ElemId int16, SetId int, primary key(ElemId, SetId))
: second table consisting sets to check against PatternSet
I'm curious is this query performance just not enough?
-- sets with intersections
select distinct
cs.SetId
from ProbablyChangedSets cs
join PatternSet s on
cs.ElemId = s.ElemId
-- |cs| = setCount * avgSetSize = 10^8 * 10 = 10^9
-- |s| = avgSetSize = 10
-- numberOfComparisons ~= 10^9 * 10 = 10^10, comparisonComplexity = O(1)
With enough parallelization it'd be very fast - it's couple of seconds.
Or your checks are sequential and you need to optimize single check operation?
Upvotes: 0