Reputation: 167
I have the following data frame that stores the correct attempts of students for each question, with '1' representing a correct attempt and '0' representing a wrong attempt, as in the following:
structure(list(X1 = c(1, 1), X2 = c(0, 0), X3 = c(1, 1), X4 = c(1,
0), X5 = c(1, 1), X6 = c(1, 1), X7 = c(1, 1), X8 = c(0, 0), X9 = c(0,
0), X10 = c(1, 1), X11 = c(1, 1), X12 = c(0, 0), X13 = c(0, 1
), X14 = c(0, 0), X15 = c(0, 0), X16 = c(1, 1), X17 = c(1, 1),
X18 = c(0, 0), X19 = c(1, 1), X20 = c(0, 0), X21 = c(1, 1
), X22 = c(1, 1), X23 = c(1, 1), X24 = c(1, 1), X25 = c(1,
1), X26 = c(1, 1), X27 = c(1, 1), X28 = c(0, 0), X29 = c(1,
1), X30 = c(1, 1), X31 = c(1, 1), X32 = c(0, 0), X33 = c(1,
1)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
My interest is in this question: 'given that a student has answered question 1 wrongly, what is the probability that he answers Q2 wrongly too?'. Or more generally, what is the probability that he answers Qi wrongly too?
It would be best if these conditional probabilities can be represented in a matrix, where ij entry is the probability he answers j-question wrongly given that he answers i-question wrongly.
My basic idea about the algorithm to achieve this is this (for i-th question): 1. Subset all the rows where the i-th entry is 0 2. Compute the proportions of '0' for each j-question in the subsetted matrix 3. Return the result as a vector 4. Repeat 1-3 for all i, and rbind these vectors into a matrix.
But is there a faster way to achieve what I want?
Upvotes: 0
Views: 41
Reputation: 1392
Your algorithm makes sense; I can't see a better way to do it. Here's an implementation using the dplyr
package, which simplifies the checkit
function.
set.seed(34342)
# simulate some data--100 students across 33 questions
x <- data.frame(matrix(sample(c(0,1),3300,replace=T),nrow=100))
# invert x to show incorrect as 1--can then use simple sums
x <- (-x + 1)
checkit <- function(x,n) {
# filter out students with incorrect for question n and calculate probs
return(x %>% filter(.,.[,n]==1) %>% {colSums(.)/nrow(.)})
}
# set up destination matrix
probs <- matrix(numeric(33*33), nrow=ncol(x))
# fill it line by line
for (i in 1:33) {
probs[i,] <- checkit(x,i)
}
This ran a simulation of 10000 students in avg time of 157 ms on a MacBookAir6,2 (mid-2013).
Upvotes: 1