Methi
Methi

Reputation: 43

Conditional sampling by group based on sample mean

I am trying to use R to make a bunch of different trivia quizzes. I have a large dataset (quiz_df) containing numerous questions divided into categories and difficulties looking like this:

           ID      Category      Difficulty
1           1      Sports            3         
2           2      Science           7        
3           3      Low culture       4         
4           4      High culture      2         
5           5      Geography         8       
6           6      Lifestyle         3   
7           7      Society           3         
8           8      History           5       
9           9      Sports            2
10         10      Science           8         
...       ...      ...             ...    
1000     1000      Science           3    

Now I want to randomly sample 3 questions from each of the 8 categories, so that the mean difficulty is 4 (or the sum being 4*24 = 96).

library(dplyr)
set.seed(100)

quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3)

This creates a random quiz set with 3 questions in each category, but does not take into consideration the difficulty. I am aware of the weight-option in sample_n:

library(dplyr)
set.seed(100)

quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3, weight = Diffculty)

But this does not solve the issue. Ideally, I would like to add an option like: sum = 96, or mean = 4.

Does anyone have any clues?

Upvotes: 1

Views: 220

Answers (1)

riccardo-df
riccardo-df

Reputation: 552

This is a brute-force solution:

library(dplyr)

# Generating sample data.
set.seed(1986)

n = 1000
quiz_df = data.frame(id = 1:n, 
                     Category = sample(c("Sports", "Science", "Society"), size = n, replace = TRUE), 
                     Difficulty = sample(1:10, size = n , replace = TRUE))


# Solution: resample until condition is met.
repeat {
  temp.draw = quiz_df %>% group_by(Category) %>% slice_sample(n = 3) # From documentation, sample_n() is outdated!
  temp.mean = mean(temp.draw$Difficulty)
  
  if (temp.mean == 4) # Check if the draw satisfies your condition.
  {
    final.draw = temp.draw
    break
  }
}

final.draw
mean(final.draw$Difficulty)

First of all, as you are new to SO, let me tell you that you should always include some example data in your questions - not just the structure, but something we can run on our machine. Anyway, for this time I just simulated some data, including three instances of Category only. My solution runs in less than two seconds, however with the whole data set the code may need more time.

The idea is to just resample until we get 24 questions, three for each category, whose mean Difficulty equals 4. Clearly, this is not an elegant solution, but it may be a first step.

I am going to try again to find a better solution. I guess the problem is that the draws are not independent, I will look deeper into that.

Ps, from the documentation I see that sample_n() has been superseeded by slice_sample(), so I suggest you to rely on the latter.

Upvotes: 1

Related Questions