RobatStats
RobatStats

Reputation: 445

Find subset with similar mean as full set

I have 50 lists, each one filled with 0s ans 1s. I know the overall proportion of 1s when you consider all the 50 lists pooled together. I want to find the 10 lists that pooled together best resemble the overall proportion of 1s.

The function I want to minimise is abs(mean(pooled subset) - mean(pooled full set))

For those who know pandas:

In pandas terms, I have a dataframe as follows

enter image description here

and so on, with a total of 50 labels, each one with a number of values ranging between 100 and 1000. I want to find the subset of 10 labels that minimises d, where d

d = abs(df.loc[df.label.isin(subset), 'Value'].mean() - df.Value.mean())

I tried to apply dynamic programming solutions to the knapsack problem, but the issue is that the contribution of each list (label) to the final sample mean changes depending on which other lists you will include afterwards (because they will increase the sample size in unpredictable ways). It's like having knapsack problem where every new item you pick changes the value of the items you previously picked. Tricky.

Is there a better algorithm to solve this problem?

Upvotes: 3

Views: 279

Answers (1)

Erwin Kalvelagen
Erwin Kalvelagen

Reputation: 16752

There is a way, somewhat cumbersome, to formulate this problem as a MIP (Mixed Integer Programming) problem.

We need the following data:

mu : mean of all data
mu(i) : mean of each subset i
n(i) : number of elements in each subset
N : number of subsets we need to select

And we need some binary decision variables

delta(i) = 1 if subset i is selected and 0 otherwise

A formal statement of the optimization problem can look like:

min | mu - sum(i, mu(i)*n(i)*delta(i)) / sum(i, n(i)*delta(i)) |
subject to 
     sum(i, delta(i)) = N
     delta(i) in {0,1}

Here sum(i, mu(i)*n(i)*delta(i)) is the total value of the selected items and sum(i, n(i)*delta(i)) is the total number of selected items.

The objective is clearly nonlinear (we have an absolute value and a division). This is sometimes called an MINLP problem (MINLP for Mixed Integer Nonlinear Programming). Although MINLP solvers are readily available, we actually can do better. Using some gymnastics we can reformulate this problem into a linear problem (by adding some extra variables and extra inequality constraints). The full details are here. The resulting MIP model can be solved with any MIP solver.

Interestingly we don't need the data values in the model, just n(i),mu(i) for each subset.

Upvotes: 1

Related Questions