Reputation: 4877
Suppose you have a list of subsets S1,...,Sn
of the integer range R={1,2,...,N}
, and an integer k
. Is there an efficient way to find a subset C
of R
of size k
such that C
is a subset of a maximal number of the Si
?
As an example, let R={1,2,3,4}
and k=2
S1={1,2,3}
S2={1,2,3}
S3={1,2,4}
S4={1,3,4}
Then I want to return either C={1,2}
or C={1,3}
(doesn't matter which).
Upvotes: 7
Views: 1713
Reputation: 60034
I hope I don't misunderstand the problem... Here a solution in SWI-Prolog
:- module(subsets, [solve/0]).
:- [library(pairs),
library(aggregate)].
solve :-
problem(R, K, Subsets),
once(subset_of_maximal_number(R, K, Subsets, Subset)),
writeln(Subset).
problem(4, 2,
[[1,2,3], [1,2,3], [1,2,4], [1,3,4]]).
problem(8, 3,
[[1, 3, 4, 6], [2, 6, 7, 8], [3, 5, 6, 7], [2, 4, 6, 7], [1, 4, 5, 8],
[2, 4, 6, 8], [1, 2, 3, 8], [1, 6, 7, 8], [1, 2, 4, 7], [1, 2, 5, 7]]).
subset_of_maximal_number(R, K, Subsets, Subset) :-
flatten(Subsets, Numbers),
findall(Num-Count,
( between(1, R, Num),
aggregate_all(count, member(Num, Numbers), Count)
), NumToCount),
transpose_pairs(NumToCount, CountToNumSortedR),
reverse(CountToNumSortedR, CountToNumSorted),
length(Subset, K), % list of free vars
prefix(SolutionsK, CountToNumSorted),
pairs_values(SolutionsK, Subset).
test output:
?- solve.
[1,3]
true ;
[7,6,2]
true.
edit: I think that the above solution is wrong, in the sense that what's returned could not be a subset of any of the input: here (a commented) solution without this problem:
:- module(subsets, [solve/0]).
:- [library(pairs),
library(aggregate),
library(ordsets)].
solve :-
problem(R, K, Subsets),
once(subset_of_maximal_number(R, K, Subsets, Subset)),
writeln(Subset).
problem(4, 2,
[[1,2,3], [1,2,3], [1,2,4], [1,3,4]]).
problem(8, 3,
[[1, 3, 4, 6], [2, 6, 7, 8], [3, 5, 6, 7], [2, 4, 6, 7], [1, 4, 5, 8],
[2, 4, 6, 8], [1, 2, 3, 8], [1, 6, 7, 8], [1, 2, 4, 7], [1, 2, 5, 7]]).
subset_of_maximal_number(R, K, Subsets, Subset) :-
flatten(Subsets, Numbers),
findall(Num-Count,
( between(1, R, Num),
aggregate_all(count, member(Num, Numbers), Count)
), NumToCount),
% actually sort by ascending # of occurrences
transpose_pairs(NumToCount, CountToNumSorted),
pairs_values(CountToNumSorted, PreferredRev),
% we need higher values first
reverse(PreferredRev, Preferred),
% empty slots to fill, preferred first
length(SubsetP, K),
select_k(Preferred, SubsetP),
% verify our selection it's an actual subset of any of subsets
sort(SubsetP, Subset),
once((member(S, Subsets), ord_subtract(Subset, S, []))).
select_k(_Subset, []).
select_k(Subset, [E|R]) :-
select(E, Subset, WithoutE),
select_k(WithoutE, R).
test:
?- solve.
[1,3]
true ;
[2,6,7]
true.
Upvotes: 1
Reputation: 1543
I think your problem is NP-Hard. Consider the bipartite graph with the left nodes being your sets and the right nodes being the integers {1, ..., N}
, with an edge between two nodes if the set contains the integer. Then, finding a common subset of size k
, which is a subset of a maximal number of the Si
, is equivalent to finding a complete bipartite subgraph K(i, k)
with maximal number of edges i*k
. If you could do this in polynomial time, then, you could find the complete bipartite subgraph K(i, j)
with maximal number of edges i*j
in polynomial time, by trying for each fixed k
. But this problem in NP-Complete (Complete bipartite graph).
So, unless P=NP, your problem does not have a polynomial time algorithm.
Upvotes: 2
Reputation: 24336
Assuming I understand your question I believe this is straightforward for fairly small sets.
I will use Mathematica code for illustration, but the concept is universal.
I generate 10
random subsets of length 4
, from the set {1 .. 8}:
ss = Subsets[Range@8, {4}] ~RandomSample~ 10
{{1, 3, 4, 6}, {2, 6, 7, 8}, {3, 5, 6, 7}, {2, 4, 6, 7}, {1, 4, 5, 8}, {2, 4, 6, 8}, {1, 2, 3, 8}, {1, 6, 7, 8}, {1, 2, 4, 7}, {1, 2, 5, 7}}
I convert these to a binary array of the presence of each number in each subset:
a = Normal@SparseArray[Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1];
Grid[a]
That is ten columns for ten subsets, and eight rows for elements {1 .. 8}.
Now generate all possible target subsets (size 3
):
keys = Subsets[Union @@ ss, {3}];
Take a "key" and extract those rows from the array and do a BitAnd operation (return 1
iff all columns equal 1
), then count the number of ones. For example, for key {1, 6, 8}
we have:
a[[{1, 6, 8}]]
After BitAnd:
Do this for each key:
counts = Tr[BitAnd @@ a[[#]]] & /@ keys;
Then find the position(s) of the maximum element of that list, and extract the corresponding parts of keys
:
keys ~Extract~ Position[counts, Max@counts]
{{1, 2, 7}, {2, 4, 6}, {2, 4, 7}, {2, 6, 7}, {2, 6, 8}, {6, 7, 8}}
With adequate memory this process works quickly for a larger set. Starting with 50,000 randomly selected subsets of length 7
from {1 .. 30}:
ss = Subsets[Range@30, {7}] ~RandomSample~ 50000;
The maximum sub-subsets of length 4
are calculated in about nine seconds:
AbsoluteTiming[
a = Normal@SparseArray[Join @@ MapIndexed[Tuples[{##}] &, ss] -> 1];
keys = Subsets[Union @@ ss, {4}];
counts = Tr[BitAnd @@ a[[#]]] & /@ keys;
keys~Extract~Position[counts, Max@counts]
]
{8.8205045, {{2, 3, 4, 20}, {7, 10, 15, 18}, {7, 13, 16, 26}, {11, 21, 26, 28}}}
I should add that Mathematica is a high level language and these operations are on generic objects, therefore if this is done truly at the binary level this should be much faster, and more memory efficient.
Upvotes: 2