Reputation: 43
I am trying to use R to make a bunch of different trivia quizzes. I have a large dataset (quiz_df) containing numerous questions divided into categories and difficulties looking like this:
ID Category Difficulty
1 1 Sports 3
2 2 Science 7
3 3 Low culture 4
4 4 High culture 2
5 5 Geography 8
6 6 Lifestyle 3
7 7 Society 3
8 8 History 5
9 9 Sports 2
10 10 Science 8
... ... ... ...
1000 1000 Science 3
Now I want to randomly sample 3 questions from each of the 8 categories, so that the mean difficulty is 4 (or the sum being 4*24 = 96).
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3)
This creates a random quiz set with 3 questions in each category, but does not take into consideration the difficulty. I am aware of the weight-option in sample_n:
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3, weight = Diffculty)
But this does not solve the issue. Ideally, I would like to add an option like: sum = 96, or mean = 4.
Does anyone have any clues?
Upvotes: 1
Views: 220
Reputation: 552
This is a brute-force solution:
library(dplyr)
# Generating sample data.
set.seed(1986)
n = 1000
quiz_df = data.frame(id = 1:n,
Category = sample(c("Sports", "Science", "Society"), size = n, replace = TRUE),
Difficulty = sample(1:10, size = n , replace = TRUE))
# Solution: resample until condition is met.
repeat {
temp.draw = quiz_df %>% group_by(Category) %>% slice_sample(n = 3) # From documentation, sample_n() is outdated!
temp.mean = mean(temp.draw$Difficulty)
if (temp.mean == 4) # Check if the draw satisfies your condition.
{
final.draw = temp.draw
break
}
}
final.draw
mean(final.draw$Difficulty)
First of all, as you are new to SO, let me tell you that you should always include some example data in your questions - not just the structure, but something we can run on our machine. Anyway, for this time I just simulated some data, including three instances of Category
only. My solution runs in less than two seconds, however with the whole data set the code may need more time.
The idea is to just resample until we get 24 questions, three for each category, whose mean Difficulty
equals 4. Clearly, this is not an elegant solution, but it may be a first step.
I am going to try again to find a better solution. I guess the problem is that the draws are not independent, I will look deeper into that.
Ps, from the documentation I see that sample_n()
has been superseeded by slice_sample()
, so I suggest you to rely on the latter.
Upvotes: 1