Reputation: 33
I'm writing for an advice about a kNN implementation I made.
I'm studying a dataset representing a multiple choice question exam and my variable are an identifier of the student, the course, the year in which they took the course, a string identifier for each question of each exam, whether they answered or not plus the score on the correspondent question.
Since some student didn't answer some question, I want to impute using kNN with k=5 what they would have answered. The rationale I used is the following: for each year and course I filter my dataset to have students which did leave at least one answer (all of them for that course and year essentially). Starting with a student, I look at all their unanswered questions. For each of them I select all the other students that did answer that question instead and I compute the distance between the two students (before passing to the successive student), sorting the distances and taking the mean of the score of the other five. The distance I introduced was one highlighting how far the colleagues are in terms of answers: i.e the distances is 0 if they have the same score on the same answer, $1$ otherwise. In this way when I look at the sum of this 0 and 1 on all the questions and take the smallest 5, should represent(most likely) students similar to me in terms of behavior.
Feel free to comment or suggest a better way to tackle the problem. The dataset has ~400000 rows (i.e questions, with repetition for students) representing 5 years for 3 courses and ~6000 unique students and 6 columns, with a rate of abstentions of 20%. Is this considered a large dataset? The problem is that the following implementation takes more than 12h to finish and I wonder what could be improved to reduce the time to minutes if possible, or at least few hours. I think that the slow part is computing each time the distance between pair of students, but even if I use a list to save some score instead of computing it twice, I don't how much time I could spare. I tried to precompute the distances using a matrix, but that doesn't seems to work either. All the data is in a data.table format, which I red to be better in terms of speed.
Any help or improvement would be appreciated, I saw there is also knn function in R but didn't know whether I could pass my distance function as an argument.
compute_distance <- function(student_a, student_b, data_course, current_question) {
# Get scores for all other questions except the current one
scores_a <- data_course[student_id == student_a & question_id != current_question, SCORE]
scores_b <- data_course[student_id == student_b & question_id != current_question, SCORE]
# Ensure both score sets have the same questions
if (length(scores_a) == length(scores_b)) {
# Binary distance (0 if same score, 1 if different)
dist <- sum(scores_a != scores_b)
return(dist)
} else {
return(Inf) # If questions don't match, return a large distance
}
}
knn_impute <- function(data, k = 5) {
# Start the entire process timer
total_start_time <- Sys.time()
# Iterate over each unique year
for (year_x in unique(data$year)) {
# Track the time for each year
year_start_time <- Sys.time()
# Print current year
cat("Processing year:", year_x, "\n")
# Filter data for this year
data_year <- data[year == year_x]
# Iterate over each unique course within the year
for (course_x in unique(data_year$course)) {
# Track the time for each course
course_start_time <- Sys.time()
# Print current course
cat(" Processing course:", course_x, "\n")
# Filter data for this course within the year
data_course <- data_year[course == course_x]
# Get unique unanswered students for this year and course
unanswered_students <- unique(data_course[ans == 0, student_id])
# Iterate over each student who has unanswered questions
for (student_x in unanswered_students) {
# Get all unanswered questions for student_x
unanswered_questions <- data_course[student_id == student_x & ans == 0, question_id]
# Iterate over each unanswered question for student_x
for (question_x in unanswered_questions) {
# Get all students who answered the current question
answered_students <- data_course[question_id == question_x & ans != 0, student_id]
# If no one answered the question, skip it
if (length(answered_students) == 0) next
# Calculate distances and store directly in a vector for sorting
distances <- sapply(answered_students, function(student_y) {
compute_distance(student_x, student_y, data_course, question_x)
})
# Sort distances and select the k nearest neighbors
nearest_neighbors <- answered_students[order(distances)[1:min(k, length(distances))]]
# Impute the score for student_x's unanswered question
imputed_score <- mean(data_course[student_id %in% nearest_neighbors & question_id == question_x, score], na.rm = TRUE)
# Update the score for student_x in the data
data[student_id == student_x & question_id == question_x, score := imputed_score]
}
}
# Calculate and print course duration
course_end_time <- Sys.time()
course_duration <- as.numeric(difftime(course_end_time, course_start_time, units = "secs"))
cat(" Time spent on course:", course_x, "=", course_duration, "seconds\n")
}
# Calculate and print year duration
year_end_time <- Sys.time()
year_duration <- as.numeric(difftime(year_end_time, year_start_time, units = "secs"))
cat("Time spent on year:", year_x, "=", year_duration, "seconds\n\n")
}
# Print total time spent
total_end_time <- Sys.time()
total_duration <- as.numeric(difftime(total_end_time, total_start_time, units = "secs"))
cat("Total time spent on kNN imputation =", total_duration, "seconds\n")
return(data)
}
# Call your function and run the imputation
imputed_data <- knn_impute(filtered_data, k = 5)
example:
student_id | question_id | course | year | ans | score |
---|---|---|---|---|---|
1 | 10 | Course_A | 2019 | 1 | 3 |
2 | 15 | Course_B | 2020 | 0 | 0 |
3 | 20 | Course_A | 2021 | 1 | 5 |
4 | 25 | Course_C | 2019 | 1 | 4 |
5 | 30 | Course_A | 2020 | 0 | 0 |
The full CSV file with 10000 rows created with chat gpt working as example can be downloaded from this link.
Upvotes: 1
Views: 257
Reputation: 102529
I tried my best to understand your logic and here is a refactored version, hope it could be faster
f <- function(d, k = 5) {
lst <- split(d, ~ans)
unanswered <- lst$`0`
answered <- lst$`1`
student_x <- unique(unanswered$student_id)
question_x <- unique(unanswered$question_id)
student_y <- with(answered, unique(student_id[question_id %in% question_x]))
for (q in question_x) {
u <- subset(d, question_id != q)
for (sa in student_x) {
score_a <- subset(u, student_id == sa)$score
distances <- vector(mode = "numeric", length = length(student_y))
for (sb_idx in seq_along(student_y)) {
score_b <- subset(u, student_id == student_y[sb_idx])$score
if (!length(score_b)) next
distances[sb_idx] <- ifelse(length(score_a) != length(score_b), Inf, sum(score_a != score_b))
}
nearest_neighbors <- student_y[order(distances)[1:min(k, length(distances))]]
imputed_score <- mean(subset(u, student_id %in% nearest_neighbors & question_id == q)$score, na.rm = TRUE)
d[with(d, student_id == sa & question_id == q), "score"] <- imputed_score
}
}
d
}
out <- do.call(
rbind,
lapply(
unname(split(df, ~ year + course)),
f
)
)
out <- out[order(as.integer(row.names(out))), ]
where the big dataset is split into smaller chunks by year + course
combos, and then apply the "KNN" to each subgroup. The gain is that, we don't need to index from the big dataframe, but just from the related subset, and thus it might be a bit faster.
Upvotes: 0
Reputation: 2816
I tried one approach, but with the test dataset provided, it actually turned out to be slower than your original solution. I suspect that the speed of the algorithm depends heavily on the distribution of missingness (for example, are non-answers clustered among specific questions/students, or are they randomly distributed?). Even so, I'm not optimistic that this version will be any faster than yours. But I'm posting it here just in case.
The important change here is that instead of computing distance between students who answered exactly the same set of questions, we instead compute distance between all pairs of students. If one student answered a question and the other didn't, we score that question as "different" for those students. If both students didn't answer a particular question, we score that question as "same" for those students. This isn't quite the same as your original metric, but it seems to me that it's similar in spirit. It also has two advantages:
The function below implements this change. I've eliminated the compute_distance()
function and just folded the necessary code into the main function.
For efficiency, this function doesn't actually calculate distances between all pairs of students. Instead, for each student who skipped at least one question, it calculates the distance between that student and all other students who answered that question. The distance score will be affected by the fact that student x didn't answer the question and every student y did; however, since the distance score for each student y will be affected in the same way, this should not be a problem.
I made two other changes to the code:
data.table
s, the :=
operator assigns values by reference; this means that the function actually changes the original data
argument. I've changed it so that it updates a copy of data
instead so as to leave the original table unchanged for comparison.knn_impute <- function(data, k = 5) {
# Start the entire process timer
total_start_time <- Sys.time()
# Copy the data so we don't overwrite the original table
data_copy = copy(data)
# Iterate over each unique year
for (year_x in unique(data$year)) {
# Track the time for each year
year_start_time <- Sys.time()
# Print current year
cat("Processing year:", year_x, "\n")
# Filter data for this year
data_year <- data[year == year_x]
# Iterate over each unique course within the year
for (course_x in unique(data_year$course)) {
# Track the time for each course
course_start_time <- Sys.time()
# Print current course
cat(" Processing course:", course_x, "\n")
# Filter data for this course within the year
data_course <- data_year[course == course_x]
# Get unique unanswered students for this year and course
unanswered_students <- unique(data_course[ans == 0, student_id])
# Iterate over each student who has unanswered questions
for (student_x in unanswered_students) {
# Get all unanswered questions for student_x
unanswered_questions <- data_course[student_id == student_x & ans == 0, question_id]
# Get all other students who answered at least one question that the
# current student didn't answer
other_students <- sort(unique(data_course[student_id != student_x & question_id %in% unanswered_questions & ans != 0, student_id]))
# For each other student who answered at least one question that the
# current student didn't answer, get the distance between that student
# and the current student
scores_a <- data_course[student_id == student_x, score]
scores_a <- ifelse(data_course[student_id == student_x, ans] == 0, -1, scores_a)
distances <- sapply(other_students, function(student_y) {
# Get the vector of answers for each student
scores_b <- data_course[student_id == student_y, score]
# Use -1 for unanswered questions
scores_b <- ifelse(data_course[student_id == student_y, ans] == 0, -1, scores_b)
# Return the number of mismatches between the two students
return(sum(scores_a != scores_b))
})
# Iterate over each unanswered question for student_x
for (question_x in unanswered_questions) {
# Get all students who answered the current question
answered_students <- sort(unique(data_course[question_id == question_x & ans != 0, student_id]))
# If no one answered the question, skip it
if (length(answered_students) == 0) next
# Get distances for students who answered the question
q_distances <- distances[other_students %in% answered_students]
# Sort distances and select the k nearest neighbors
nearest_neighbors <- answered_students[order(q_distances)[1:min(k, length(q_distances))]]
# Impute the score for student_x's unanswered question
imputed_score <- mean(data_course[student_id %in% nearest_neighbors & question_id == question_x, score], na.rm = TRUE)
# Update the score for student_x in the data
data_copy[year == year_x & course == course_x & student_id == student_x & question_id == question_x, score := imputed_score]
}
}
# Calculate and print course duration
course_end_time <- Sys.time()
course_duration <- as.numeric(difftime(course_end_time, course_start_time, units = "secs"))
cat(" Time spent on course:", course_x, "=", course_duration, "seconds\n")
}
# Calculate and print year duration
year_end_time <- Sys.time()
year_duration <- as.numeric(difftime(year_end_time, year_start_time, units = "secs"))
cat("Time spent on year:", year_x, "=", year_duration, "seconds\n\n")
}
# Print total time spent
total_end_time <- Sys.time()
total_duration <- as.numeric(difftime(total_end_time, total_start_time, units = "secs"))
cat("Total time spent on kNN imputation =", total_duration, "seconds\n")
return(data_copy)
}
Upvotes: 0