twowo
twowo

Reputation: 631

Divide set into subsets with equal number of elements

For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real numbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).

Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?

Upvotes: 6

Views: 11572

Answers (6)

I was able to solve a similar task by using a generalized version of the Duplex algorithm from Snee, R. D. (1977).

You partition the original dataset into mutually exclusive subsets through a stepwise process of distributing data points from the original dataset. At each iteration, the least filled subset relative to its capacity selects a data point from the remaining pool, opting for the one farthest from the current data points within the subset.

Upvotes: 0

phunctor
phunctor

Reputation: 609

Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.

Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.

Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.

The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.

Upvotes: 0

Dino
Dino

Reputation: 1586

In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:

120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140

Clearly, this is not an option :)

Upvotes: 1

Ramnath
Ramnath

Reputation: 55735

You can easily do this using the plyr library in R. Here is the code.

require(plyr)

# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))

# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))

Upvotes: 1

btilly
btilly

Reputation: 46497

I would tackle this as follows:

  1. Divide into 3 equal subsets.
  2. Figure out the mean and variance of each subset. From them construct an "unevenness" measure.
  3. Compare each pair of elements, if swapping would reduce the "unevenness", swap them. Continue until there are either no more pairs to compare, or the total unevenness is below some arbitrary "good enough" threshold.

Upvotes: 2

Michał Šrajer
Michał Šrajer

Reputation: 31182

If I understand correctly your problem, you might use random.sample() in python:

import random

pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset

Upvotes: 5

Related Questions