Tarapararamus
Tarapararamus

Reputation: 41

Data structure for finding subsets that present in a set

Consider we have n subsets of the set of integers from 1 to N. For example,

N = 5, n = 3

s_1 = {1, 4}
s_2 = {2, 4, 5}
s_3 = {1, 5}

These subsets are fixed and never change (I know all of them in advance and have an arbitrary amount of time to precomute anything I need in the future).

There is a process that gets an arbitrary subset of integers from 1 to N (not from the list above) and needs to return the subsets from the list above such that all of their elements are in the input. For example, for the input {1, 2, 3, 5} the output must be s3, and for the input {1, 4, 5} the output is s_1, s_3.

I need an algorithm which compexity does not depend on the number of subsets (n) (it is OK to depend on the size of the result). The expected value of N ~ 1 000 000, n ~ 100 000, the average size of a subset is 10 (it is a constant known in advance) and the average size of a typical input is 30. The expected size of result is 0-5 (which is almost always either 0 or 1).

For now, for every number from 1 to N I precompute an array that stores the subsets that this number is included in. When I get the input for every number I find the subsets and count the number of values I have already encountered from this subset. Since I know the size of every subset and all numbers in the input are unique, I can easily understand if all the numbers from the subsets have been encountered.

The problem is if, for example, there is a number that is in every subset, every time I bump into it on the input, I have to iterate over all subsets, even though it is very unlikely that all of them will remain on my candidate list when I finish consuming the whole input.

Addition

A simpler task (which I am also interested in): find not all, but any subset all elements of which are on the input.

Upvotes: 4

Views: 246

Answers (4)

Alexander Nenashev
Alexander Nenashev

Reputation: 22769

You can build a tree with nodes being numbers in subsets and with set names as leaves. Then you traverse the tree with input and collect subsets in the leaves:

const input = [1, 4, 5];

const sets = {
  s_1:[1, 4],
  s_2:[2, 4, 5],
  s_3:[1, 5],
  s_4:[1, 4, 7],
};

const nodes = {};

for(const [name, set] of Object.entries(sets)){
  let node = nodes;
  for(const num of set){
    node = node[num] ??= {};
  }
  (node.sets ??= []).push(name);
}

console.log(JSON.stringify(nodes));
  
function findSets(input){  

  const traverse = (nodes, idx = 0) => {
    if(!nodes) return;
    nodes.sets && result.push(...nodes.sets);
    for(let i=idx;i<input.length;i++){
      traverse(nodes[input[i]], i+1);
    }
  }
  
  const result = [];
  traverse(nodes);
  return result;
  
}

console.log(...findSets([1, 4, 5]));
console.log(...findSets([1, 2, 3, 5]));
console.log(...findSets([1, 2, 4, 5]));
console.log(...findSets([2, 3, 4, 5]));

Works with OP's data size (here we randomly select 3 subsets, merge into an input and find the subsets with the input):

const n = 100000;
const N = 1000000;

const sets = Array.from({length:n}, () => Array.from({length:10}, () => Math.random()*N|0).sort((a,b)=>a-b));

const nodes = {};

for(const [name, set] of Object.entries(sets)){
  let node = nodes;
  for(const num of set){
    node = node[num] ??= {};
  }
  (node.sets ??= []).push(name);
}
  
function findSets(input){  

  const traverse = (nodes, idx = 0) => {
    if(!nodes) return;
    nodes.sets && result.push(...nodes.sets);
    for(let i=idx;i<input.length;i++){
      traverse(nodes[input[i]], i+1);
    }
  }

  const result = [];
  traverse(nodes);
  return result;
  
}

const input = sets[Math.random()*sets.length|0];
const input2 = sets[Math.random()*sets.length|0];
const input3 = sets[Math.random()*sets.length|0];
input.push(...input2), input.push(...input3);
input.sort((a,b) => a-b);

console.log('input:', ...input);

console.log('found sets (indices):', ...findSets(input));

Upvotes: 0

btilly
btilly

Reputation: 46399

Here is the Trie based solution that I claimed existed yesterday.

It will only depend on the size k of the searched subset. Unfortunately we could try including or not including each element, so that is O(2^k) possible nodes to visit. We do, however, only look at combinations that can lead to an actual set. But with a search set of size 30, that still may be a lot.

Therefore the max_value_for_depth stuff. That tries to cut off as many searches as you can near the leaf. It is a little counterintuitive, but doing more work to reduce the search space generally pays off. So I did that.

With the kind of data that you're talking about, this should perform pretty well.

(You can save a lot of memory if you get rid of the path attribute. But I had a minor bug, and having it really helped me track it down.)

class Trie:
    def __init__(self):
        # This is where we will put the sets.
        self.set = None

        # A lookup from next value to the Trie containing them.
        self.next_nodes = {}

        # What is the largest value that can match x more nodes?
        self.max_value_for_depth = []

        # Know where we are if we need to debug.
        self.path = []

    def add_subset (self, subset):
        curr = self
        sorted_subset = sorted(subset)
        # Just walk down the Trie, following a path to where we want.
        for i in range(len(sorted_subset)):
            if sorted_subset[i] not in curr.next_nodes:
                node = Trie()
                node.path = curr.path + [sorted_subset[i]]
                curr.next_nodes[sorted_subset[i]] = node
            depth = len(sorted_subset) - i
            while len(curr.max_value_for_depth) < depth:
                curr.max_value_for_depth.append(sorted_subset[i])
            j = depth - 1
            while curr.max_value_for_depth[j] <= sorted_subset[i] and -1 < j:
                curr.max_value_for_depth[j] = sorted_subset[i]
                j -= 1
            curr = curr.next_nodes[sorted_subset[i]]

        # And here we are.
        curr.set = subset

    def search_for_subsets (self, subset):
        # Use an explicit for recursion. Somewhat faster, and no depth limit.
        stack = [[self, 0]]
        subset = sorted(subset)
        while 0 < len(stack):
            curr, i = stack.pop()

            depth_left = len(subset) - i
            if 0 == depth_left:
                pass # Can't find another.
            elif depth_left < len(curr.max_value_for_depth) and curr.max_value_for_depth[depth_left] < subset[i]:
                pass # subset doesn't contain enough to match.
            else:
                # Try skipping this element.
                stack.append([curr, i+1])

                # Try including this element.
                if subset[i] in curr.next_nodes:
                    next_node = curr.next_nodes[subset[i]]
                    stack.append([next_node, i+1])

                    # And did we find a set?
                    if next_node.set:
                        yield next_node.set


    def count_subsets (self, subset):
        return len(list(self.search_for_subsets(subset)))

    # For debugging.
    def __repr__(self):
        return "Trie " + str(self.path)

t = Trie()
t.add_subset([1, 3, 4])
t.add_subset([1, 3, 5])
t.add_subset([3, 7])

for subset in t.search_for_subsets([1, 3, 5, 7]):
    print(subset)
print(t.count_subsets([1, 3, 5, 7])

Upvotes: 1

Dillon Davis
Dillon Davis

Reputation: 7740

I propose that rather recording a subset under every value it contains, that you choose a single representative element, and record this subset under only that element. With appropriate rebalancing, most entries in your array will have one or two affiliated subsets, even in the case where some element e is present in most subsets. You can then lookup the values in each of these subsets, and do a basic subset test against the input.

To implement the rebalancing you have two options. You could pick a random element as the representative element in each subset, and put that subset into the group of subsets affiliated with that element. Then randomly pick a group amongst those with the most subsets, and attempt to randomly reassign one of its subsets a different representative element (moving it to another group), but only if that new group is smaller than the current group. Repeating this process for enough steps, your groupings will tend towards an optimally balanced assignment.

The alternative is that you could represent your subsets and values as a bipartite graph, subsets as vertices and the left, values as vertices on the right, and edges of weight 1 between them accordingly. Then attach each subset vertex to a single source vertex S, with edges of weight 1, and attach the value vertices to a sink vertex T, each of these edges having weight corresponding to the number of subsets that value was in. You can then run a max-flow algorithm over this graph, and non-zero flow through the middle edges (between subsets and values) corresponds to representative elements. You can then re-run the max-flow algorithm, reducing the weights of the edges connecting value vertices to the sink, to force more balanced assignments, until such point that the max-flow no longer equals the total number of subsets. This will be much slower than the randomized approach, but would be deterministic and optimal.

With those details out of the way, what's the new worst case? Lone frequent values are no longer a problem, its isolated sets of values that contain no other elements outside their set- imagine sampling a few hundred or thousand elements from the powerset of a set of base elements.

What you can do to combat this pathological case, if it ends up being an issue in your particular subsets, is to repeat this process of picking representative elements a second time for each individual group independently. This won't increase your space requirements, should be much faster to preprocess, and will help quite a bit. You can (should) also opt to do only for groups which seemingly contain this pathological case (having a much higher than expected number of subsets)- the rest you can just do the basic subset tests one by one as before. This is because this adds another factor of input_size to that portion of the search, so you want to make sure that its pruning the subsets enough to warrant the cost.

The other thing you can do is store the actual subsets for a group in a trie data structure, where values are ordered by frequency within that subtree of the trie, to reduce repetition in your subset tests. For purely random subsets this wouldn't have any noticeable effect, but for real-life data and for pathological inputs, it can certainly speed things up.

Upvotes: 0

Matt Timmermans
Matt Timmermans

Reputation: 59144

I think you're on the right track with tracking the remaining elements in each subset, but you can optimize by making a tree of subsets.

Let's say that N=10 and you have subsets {1,2,3}, {1,2,4}, and a bunch of others.

Find the 2 elements that appear together most often in a larger subset, then make a new element for their combination if such a subset doesn't already exist. We'll say that 10={1,2}, for example, and now our original subsets are {10,3} and {10,4}. Note that if the pair was used 3 or more times, the total number of element-in-subset relationships is reduced.

Continue this process until all of your subsets are size <= 2.

Now you can use your original marking/counting procedure, except that when a subset is found, you need to see if it's an element in any other sets and mark those ones, too.

I haven't worked out exactly how to evaluate the improvement on the worst or average case that this optimization will provide, but it's extremely unlikely in practice that you'll end up with elements that are used in a large proportion of subsets.

Since your subsets are fixed, you can just try it and see how it goes.

Upvotes: 0

Related Questions