Diego Tejada
Diego Tejada

Reputation: 41

Finding combination of subsets with largest intersection

Suppose I have a list of n sets, is there an efficient way of computing the combination of r sets whose intersection is larger than any other combination of r sets in the list?

In particular, I have a list of about 6000 sets of strings and I want to choose about 9 sets from this list such that they share the most strings out of all combinations of 9 sets. The problem is that if I were to brute-force it and compute all of the combinations and look for the max intersection it would take (6000 choose 9) ~ 3e28 computations, so I need either a more efficient algorithm or some reliable heuristic.

In addition, I would like to extend this question, if possible, to choosing a variable r such that the total element size of any combination is less than some arbitrary threshold, as opposed to choosing a constant r. That is, instead of just choosing 9 sets from the list of 6000 to make a combination, the algorithm would add sets of strings until the total number of strings in the combination exceeds some threshold, say 40 strings.
This is a lot closer to what I originally wanted to do, but I realized that taking constant size combinations of 9 sets works somewhat decently for the list I'm working with and would probably be a lot easier to implement, although an algorithm that could accomplish this would be preferable. From what I can gather, this problem is similar to the knapsack problem, although the only efficient way I know of solving that is with dynamic programming and I'm not sure how I would implement dynamic programming in this case given that I have to compute the running intersection to get the weights instead of having a pre-computed list of weights as with regular knapsack.

Upvotes: 4

Views: 243

Answers (1)

user3386109
user3386109

Reputation: 34829

Here's an idea for you:

Choose 9 sets at random from the 6000. For the remaining 5991 sets, determine which of the 9 sets to replace to maximize the intersection (or don't use the new set if it provides no improvement). That's roughly 6000 * 9 operations.

Do this K times, keeping track of the answers. Then find the set that occurs in most answers (the winning set). Repeat the whole process, but now the randomly chosen starting set must include the winning set, and the winning set may not be replaced.

Repeat until 8 of the 9 sets are winners. Then the 9th set is whatever set maximizes the intersection. You could also terminate early if all the answers are identical.

The run time is roughly 6000 * 9 * K * 8 operations, where one operation computes the intersection of 9 sets.

Upvotes: 1

Related Questions