Sandeep
Sandeep

Reputation: 178

Faster way of searching array of sets

I have an array containing 100,000 sets. Each set contains natural numbers below 1,000,000. I have to find the number of ordered pairs {m, n}, where 0 < m < 1,000,000, 0 < n < 1,000,000 and m != n, which do not exist together in any of 100,000 sets. A naive method of searching through all the sets leads to 10^5 * (10^6 choose 2) number of searches.

For example I have 2 sets set1 = {1,2,4} set2 = {1,3}. All possible ordered pairs of numbers below 5 are {1,2}, {1,3}, {1,4}, {2,3}, {2,4} and {3,4}. The ordered pairs of numbers below 5 which do not exist together in set 1 are {1,3},{2,3} and {3,4}. The ordered pairs below 5 missing in set 2 are {1,2},{1,4},{2,3},{2,4} and {3,4}. The ordered pairs which do not exist together in both the sets are {2,3} and {3,4}. So the count of number of ordered pairs missing is 2.

Can anybody point me to a clever way of organizing my data structure so that finding the number of missing pairs is faster? I apologize in advance if this question has been asked before.

Update: Here is some information about the structure of my data set. The number of elements in each set varies from 2 to 500,000. The median number of elements is around 10,000. The distribution peaks around 10,000 and tapers down in both direction. The union of the elements in the 100,000 sets is close to 1,000,000.

Upvotes: 14

Views: 712

Answers (6)

tamas.kenez
tamas.kenez

Reputation: 7789

Physically storing each possible pair would take too much memory. We have 100k sets and an average set has 10k numbers = 50M pairs = 400MB with int32 (and set<pair<int, int>> needs much more than 8 bytes per element).

My suggestion is based on two ideas:

  • don't store, only count the missing pairs
  • use interval set for compact storage and fast set operations (like boost interval set)

The algorithm is still quadratic on the number of elements in the sets but needs much less space.

Algorithm:

  • Create the union_set of the individual sets.
  • We also need a data structure, let's call it sets_for_number to answer this question: which sets contain a particular number? For the simplest case this could be unordered_map<int, vector<int>> (vector stores set indices 0..99999)
  • Also create the inverse sets for each set. Using interval sets this takes only 10k * 2 * sizeof(int) space per set on average.

    dynamic_bitset<> union_set = ...; //union of individual sets (can be vector<bool>)
    vector<interval_set<int>> inverse_sets = ...; // numbers 1..999999 not contained in each set
    int64_t missing_count = 0;
    
    for(int n = 1; n < 1000000; ++n)
        // count the missing pairs whose first element is n
        if (union_set.count(n) == 0) {
            // all pairs are missing
            missing_count += (999999 - n);
        } else {
            // check which second elements are not present
            interval_set<int> missing_second_elements = interval_set<int>(n+1, 1000000);
            // iterate over all sets containing n
            for(int set_idx: sets_for_number.find(n)) {
               // operator&= is in-place intersection
               missing_second_elements &= inverse_sets[set_idx];
            }
            // counting the number of pairs (n, m) where m is a number
            // that is not present in any of the sets containing n
            for(auto interval: missing_second_elements)
                missing_count += interval.size()
        }
    }
    

Upvotes: 1

featuredpeow
featuredpeow

Reputation: 2201

First lets solve more simple task of counting number of elements not present in your sets. This task can be reworded in more simple form - instead of 100,000 sets you can think about 1 set which contains all your numbers. Then number of elements not present in this set is x = 1000000 - len(set). Now you can use this number x to count number of combinations. With repetitions: x * x, without repetitions: x * (x - 1). So bottom line of my answer is to put all your numbers in one big set and use it's length to find number of combinations using combinatorics.

Update

So above we have a way to find number of combinations where each element in combination is not in any of the sets. But question was to find number of combinations where each combination is not present in any of the sets.

Lets try to solve simpler problem first:

  • your sets have all numbers in them, none missing
  • each number is present exactly in one set, no duplicates across sets

How you would construct such combinations over such sets? You would simply pick two elements from different sets and resulting combination would not be in any of the sets. Number of such combinations could be counted using following code (it accepts sizes of the sets):

int count_combinations(vector<int>& buckets) {
  int result = 0;
  for (int i=0; i < buckets.size(); ++i) {
    for (int j=i+1; j < buckets.size(); ++j) {
      result += buckets[i] * buckets[j];
    }
  }
  return result;
}

Now let's imagine that some numbers are missing. Then we can just add additional set with those missing numbers to our sets (as a separate set). But we also need to account that given there were n missing numbers there would be n * (n-1) combinations constructed using only these missing numbers. So following code will produce total number of combinations with account to missing numbers:

int missing_numbers = upper_bound - all_numbers.size() - 1;
int missing_combinations = missing_numbers * (missing_numbers - 1);

return missing_combinations + count_combinations(sets, missing_numbers); 

Now lets imagine we have a duplicate across two sets: {a, b, c}, {a, d}. What types of errors they will introduce? Following pairs: {a, a} - repetition, {a, d} - combination which is present in second set. So how to treat such duplicates? We need to eliminate them completely from all sets. Even single instance of a duplicate will produce combination present in some set. Because we can just pick any element from the set where duplicate was removed and produce such combination (in my example - if we will keep a in first set, then pick d from the second to produce {a, d}, if we will keep a in second set, then pick b or c from the first to produce {a, b} and {a, c}). So duplicates shall be removed.

Update

However we can't simply remove all duplicates, consider this counterexample: {a, b} {a, c} {d}. If we simply remove a we will acquire {b} {c} {d} and lost information about not-existing combination {a, d}. Consider another counterexample: {a, b} {a, b, c} {b, d}. If we simply remove duplicates we will acquire {c} {d} and lost information about {a, d}.

Also we can't simply apply such logic to pairs of sets, a simple counter example for numbers < 3: {1, 2} {1} {2}. Here number of missing combinations is 0, but we will incorrectly count in {1, 2} if we will apply duplicates removal to pair of sets. Bottom line is that I can't come up with good technique which will help to correctly handle duplicate elements across sets.

Upvotes: 3

Adam Martin
Adam Martin

Reputation: 1218

What you can do, depending on memory requirements, is take advantage of the ordering of Set, and iterate over the values smartly. Something like the code below (untested). You'll iterate over all of your sets, and then for each of your sets you'll iterate over their values. For each of these values, you'll check all of the values in the set after them. Our complexity is reduced to the number of sets times the square of their sizes. You can use a variety of methods to keep track of your found/unfound count, but using a set should be fine, since insertion is simply O(log(n)) where n is no more than 499999500000. In theory using a map of sets (mapping based on the first value) could be slightly faster, but in either case the cost is minimal.

long long numMissing(const std::array<std::set<int>, 100000>& sets){
    std::set<pair<int, int> > found;
    for (const auto& s : sets){
        for (const auto& m : s){
            const auto &n = m;
            for (n++; n != s.cend(); n++){
                found.emplace(m, n);
            }
        }
    }
    return 499999500000 - found.size();
}

Upvotes: 2

user1726343
user1726343

Reputation:

If you are looking for combinations across sets, there is a way to meaningfully condense your dataset, as shown in frenzykryger's answer. However, from your examples, what you're looking for is the number of combinations available within each set, meaning each set contains irreducible information. Additionally, you can't use combinatorics to simply obtain the number of combinations from each set either; you ultimately want to deduplicate combinations across all sets, so the actual combinations matter.

Knowing all this, it is difficult to think of any major breakthroughs you could make. Lets say you have i sets and a maximum of k items in each set. The naive approach would be:

  • If your sets are typically dense (i.e. contain most of the numbers between 1 and 1,000,000), replace them with the complement of the set instead
  • Create a set of 2 tuples (use a set structure that ensures insertion is idempotent)
  • For each set O(i):
    • Evaluate all combinations and insert into set of combinations: O(k choose 2)

The worst case complexity for this isn't great, but assuming you have scenarios where a set either contains most of the numbers between 0 and 1,000,000, or almost none of them, you should see a big improvement in performance.

Another approach would be to go ahead and use combinatorics to count the number of combinations from each set, then use some efficient approach to find the number of duplicate combinations among sets. I'm not aware of such an approach, but it is possible it exists.

Upvotes: 3

Steve Lee
Steve Lee

Reputation: 1

If it is possible, have a set of all numbers and remove each of the number when you insert to your array of set. This will have a O(n) space complexity.

Of course if you don't want to have high spec complexity, maybe you can have a range vector. For each element in the vector, you have a pair of numbers which are the start/end of a range.

Upvotes: 0

serhiyb
serhiyb

Reputation: 4833

As an option you can build Bloom Filter(s) over your sets.

Before checking against all sets you can quickly lookup at your bloom filter and since it will never produce false negatives you can safely use your pair as its not present in your sets.

Upvotes: 1

Related Questions