Combinatorics: randomise without repetition of elements, am I trying to do something impossible?

Question

Apologies for the phrasing of the title, and apologies if the answer is terribly obvious. My quantitative background is not strong and I may be asking a stupid question.

I have a set of 24 items which we can imagine as pictures, and 24 labels for those pictures. This means I have 552 possible picture-label pairs.

I want to collect 10 ratings for each of these picture-label pairs, so 5520 ratings in total, and I want to collect them from 460 participants giving 12 ratings each.

My problem arises when I try to generate input files (select 12 picture-label pairs) without repetition. I can do this without repetition of pairs, but I also don't want any picture to appear twice, nor any label to appear twice, in any given participant's input.

I have tried to do this by starting with a dataframe with 5520 rows containing all the picture-label pairs I want to collect ratings for. Then I sample 12 rows from this dataframe until I find a sample that doesn't contain any repetitions, remove those rows from the dataframe and continue. This results in being stuck in an infinite while loop as I reach a point where it is no longer possible to sample a df without repetitions from the remaining rows.

Is this because my approach is wrong, or am I trying to do something impossible?


pairs <- as.data.frame(permutations(n = 24, r = 2, v = seq(1:24), repeats.allowed=F))
nrow(pairs)

for (i in seq(1, to =552, by =12)) {

#get sample
s <- sample(nrow(shuffled_pairs),12)
d <- shuffled_pairs[s,]

#check for repetitions of either V1 (pic) or V2 (label)
while (length(unique(d$V1))<12 | length(unique(d$V1))<12) {
    s <- sample(nrow(shuffled_pairs),12)
    d <- shuffled_pairs[s,]    
}

shuffled_pairs <- shuffled_pairs[-s,]

}

Allan Cameron · Accepted Answer

The answer is that this is not possible with 46 raters: you need 48 raters doing 12 ratings each to cover the 10 * 24 * 24, or 5760, samples you need. However, with this caveat, it is possible to get all the samples you want within the desired constraints. The code itself is pretty short:

mod24 <- function(x) (x + 0:11 - 1) %% 24 + 1

df <- data.frame(picture = rep(c(rep(1:12, 24), rep(13:24, 24)), 10),
                 label = rep(do.call("c", lapply(1:24, mod24)), 20),
                 rater   = rep(c(rep(1:48, each = 12)), 10) + rep(0:9 * 48, each = 576))

However, this requires quite a bit of explanation.

You can make your question a little easier by noting that whatever you do, you can simplify it by splitting your 480 people into ten groups of 48 people, with each group doing the same thing, i.e. between them rating each picture/label combination exactly once, using exactly 12 ratings each. So you can focus on how one group of 48 people would perform this task to cover all 576 possibilities exactly once.

Another thing to note is that since everyone has to pick 12 paintings, you can further simplify by splitting the 48-member group into two groups of 24 people who get either the first twelve or second twelve paintings. That way, you are guaranteed not to have any repeated paintings per rater.

Now all you need to do is ensure that every label is given to every painting exactly once. You can do this by giving the first participant's paintings labels 1:12, then the second participant's paintings labels 2:13 etc, until you get to 13:24, after which the labels become c(14:24, 1), then c(15:24, 1:2) etc. This ensures that in the group with paintings 1:12, each painting gets each label assigned once and only once. Now do the same for paintings 13:24. You will have 48 people with 12 ratings each, covering all possible combinations once.

Do the same for each group of 48 people, and you will have 10 ratings per unique picture / label pair, and each rater will have given 12 ratings, and no rater will have rated the same painting or label twice.

Going back to our code, we can see df contains 5760 samples:

nrow(df)
#> [1] 5760

It has 576 unique combinations of picture and label, each repeated 10 times:

table(df$picture, df$label)
#>     
#>       1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#>   1  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   2  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   3  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   4  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   5  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   6  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   7  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   8  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   9  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   11 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   13 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   14 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   15 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   16 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   17 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   18 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   19 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   20 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   21 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   22 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   23 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
#>   24 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

and each of the 480 raters has 12 pairs to rate

table(table(df$rater))

#>  12 
#> 480

All of which are unique:

table(sapply(split(df, df$rater), function(x) nrow(unique(x))))

#>  12 
#> 480

EDIT

The OP is concerned that the constant co-occurrence of groups of pictures may introduce a bias. The way round this is to pair the first person in the picture 1:12 group and the first person in the 13:24 group, and allow them to randomly trade some of their allocations. Their pictures cannot become duplicates because there is no overlap in the pictures they have, and their labels cannot become duplicated because they always trade the same labels:

swaps <- do.call(c, lapply(1:10, function(x) c(rbinom(24 * 12, 1, 0.5), rep(0, 24 * 12))))
swap_out <- df[swaps == 1, ]
df[swaps == 1, ] <- df[which(swaps == 1) + 24 * 12, ]
df[which(swaps == 1) + 24 * 12, ] <- swap_out

This new data frame still meets all of the specifications.

Combinatorics: randomise without repetition of elements, am I trying to do something impossible?

Answers (1)

Related Questions