mitchus
mitchus

Reputation: 4877

Most common subset of size k

Suppose you have a list of subsets S1,...,Sn of the integer range R={1,2,...,N}, and an integer k. Is there an efficient way to find a subset C of R of size k such that C is a subset of a maximal number of the Si?

As an example, let R={1,2,3,4} and k=2

S1={1,2,3}
S2={1,2,3}
S3={1,2,4}
S4={1,3,4}

Then I want to return either C={1,2} or C={1,3} (doesn't matter which).

Upvotes: 7

Views: 1713

Answers (3)

CapelliC
CapelliC

Reputation: 60034

I hope I don't misunderstand the problem... Here a solution in SWI-Prolog

:- module(subsets, [solve/0]).
:- [library(pairs),
    library(aggregate)].

solve :-
    problem(R, K, Subsets),
    once(subset_of_maximal_number(R, K, Subsets, Subset)),
    writeln(Subset).

problem(4, 2,
[[1,2,3], [1,2,3], [1,2,4], [1,3,4]]).

problem(8, 3,
[[1, 3, 4, 6], [2, 6, 7, 8], [3, 5, 6, 7], [2, 4, 6, 7], [1, 4, 5, 8],
 [2, 4, 6, 8], [1, 2, 3, 8], [1, 6, 7, 8], [1, 2, 4, 7], [1, 2, 5, 7]]).

subset_of_maximal_number(R, K, Subsets, Subset) :-
    flatten(Subsets, Numbers),
    findall(Num-Count,
        (   between(1, R, Num),
            aggregate_all(count, member(Num, Numbers), Count)
        ), NumToCount),
    transpose_pairs(NumToCount, CountToNumSortedR),
    reverse(CountToNumSortedR, CountToNumSorted),
    length(Subset, K), % list of free vars
    prefix(SolutionsK, CountToNumSorted),
    pairs_values(SolutionsK, Subset).

test output:

?- solve.
[1,3]
true ;
[7,6,2]
true.

edit: I think that the above solution is wrong, in the sense that what's returned could not be a subset of any of the input: here (a commented) solution without this problem:

:- module(subsets, [solve/0]).
:- [library(pairs),
    library(aggregate),
    library(ordsets)].

solve :-
    problem(R, K, Subsets),
    once(subset_of_maximal_number(R, K, Subsets, Subset)),
    writeln(Subset).

problem(4, 2,
[[1,2,3], [1,2,3], [1,2,4], [1,3,4]]).

problem(8, 3,
[[1, 3, 4, 6], [2, 6, 7, 8], [3, 5, 6, 7], [2, 4, 6, 7], [1, 4, 5, 8],
 [2, 4, 6, 8], [1, 2, 3, 8], [1, 6, 7, 8], [1, 2, 4, 7], [1, 2, 5, 7]]).

subset_of_maximal_number(R, K, Subsets, Subset) :-
    flatten(Subsets, Numbers),
    findall(Num-Count,
        (   between(1, R, Num),
            aggregate_all(count, member(Num, Numbers), Count)
        ), NumToCount),

    % actually sort by ascending # of occurrences
    transpose_pairs(NumToCount, CountToNumSorted),
    pairs_values(CountToNumSorted, PreferredRev),

    % we need higher values first
    reverse(PreferredRev, Preferred),

    % empty slots to fill, preferred first
    length(SubsetP, K),
    select_k(Preferred, SubsetP),

    % verify our selection it's an actual subset of any of subsets
    sort(SubsetP, Subset),
    once((member(S, Subsets), ord_subtract(Subset, S, []))).

select_k(_Subset, []).
select_k(Subset, [E|R]) :-
    select(E, Subset, WithoutE),
    select_k(WithoutE, R).

test:

?- solve.
[1,3]
true ;
[2,6,7]
true.

Upvotes: 1

Edouard
Edouard

Reputation: 1543

I think your problem is NP-Hard. Consider the bipartite graph with the left nodes being your sets and the right nodes being the integers {1, ..., N}, with an edge between two nodes if the set contains the integer. Then, finding a common subset of size k, which is a subset of a maximal number of the Si, is equivalent to finding a complete bipartite subgraph K(i, k) with maximal number of edges i*k. If you could do this in polynomial time, then, you could find the complete bipartite subgraph K(i, j) with maximal number of edges i*j in polynomial time, by trying for each fixed k. But this problem in NP-Complete (Complete bipartite graph).

So, unless P=NP, your problem does not have a polynomial time algorithm.

Upvotes: 2

Mr.Wizard
Mr.Wizard

Reputation: 24336

Assuming I understand your question I believe this is straightforward for fairly small sets.

I will use Mathematica code for illustration, but the concept is universal.

I generate 10 random subsets of length 4, from the set {1 .. 8}:

ss = Subsets[Range@8, {4}] ~RandomSample~ 10
{{1, 3, 4, 6}, {2, 6, 7, 8}, {3, 5, 6, 7}, {2, 4, 6, 7}, {1, 4, 5, 8},
 {2, 4, 6, 8}, {1, 2, 3, 8}, {1, 6, 7, 8}, {1, 2, 4, 7}, {1, 2, 5, 7}}

I convert these to a binary array of the presence of each number in each subset:

a = Normal@SparseArray[Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1];

Grid[a]

Mathematica graphics

That is ten columns for ten subsets, and eight rows for elements {1 .. 8}.

Now generate all possible target subsets (size 3):

keys = Subsets[Union @@ ss, {3}];

Take a "key" and extract those rows from the array and do a BitAnd operation (return 1 iff all columns equal 1), then count the number of ones. For example, for key {1, 6, 8} we have:

a[[{1, 6, 8}]]

Mathematica graphics

After BitAnd:

Mathematica graphics

Do this for each key:

counts = Tr[BitAnd @@ a[[#]]] & /@ keys;

Then find the position(s) of the maximum element of that list, and extract the corresponding parts of keys:

keys ~Extract~ Position[counts, Max@counts]
{{1, 2, 7}, {2, 4, 6}, {2, 4, 7}, {2, 6, 7}, {2, 6, 8}, {6, 7, 8}}

With adequate memory this process works quickly for a larger set. Starting with 50,000 randomly selected subsets of length 7 from {1 .. 30}:

ss = Subsets[Range@30, {7}] ~RandomSample~ 50000;

The maximum sub-subsets of length 4 are calculated in about nine seconds:

AbsoluteTiming[
  a = Normal@SparseArray[Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1];
  keys = Subsets[Union @@ ss, {4}];
  counts = Tr[BitAnd @@ a[[#]]] & /@ keys;
  keys~Extract~Position[counts, Max@counts]
]
 {8.8205045, {{2, 3, 4, 20},
              {7, 10, 15, 18},
              {7, 13, 16, 26},
              {11, 21, 26, 28}}}

I should add that Mathematica is a high level language and these operations are on generic objects, therefore if this is done truly at the binary level this should be much faster, and more memory efficient.

Upvotes: 2

Related Questions