Reputation: 631
For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real numbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).
Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?
Upvotes: 6
Views: 11572
Reputation: 1
I was able to solve a similar task by using a generalized version of the Duplex algorithm from Snee, R. D. (1977).
You partition the original dataset into mutually exclusive subsets through a stepwise process of distributing data points from the original dataset. At each iteration, the least filled subset relative to its capacity selects a data point from the remaining pool, opting for the one farthest from the current data points within the subset.
Upvotes: 0
Reputation: 609
Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.
Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.
Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.
The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.
Upvotes: 0
Reputation: 1586
In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:
120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140
Clearly, this is not an option :)
Upvotes: 1
Reputation: 55735
You can easily do this using the plyr
library in R. Here is the code.
require(plyr)
# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))
# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))
Upvotes: 1
Reputation: 46497
I would tackle this as follows:
Upvotes: 2
Reputation: 31182
If I understand correctly your problem, you might use random.sample()
in python:
import random
pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset
Upvotes: 5