jacopoburelli
jacopoburelli

Reputation: 33

Custom k-Nearest Neighbor (kNN) slow implementation

I'm writing for an advice about a kNN implementation I made.

I'm studying a dataset representing a multiple choice question exam and my variable are an identifier of the student, the course, the year in which they took the course, a string identifier for each question of each exam, whether they answered or not plus the score on the correspondent question.

Since some student didn't answer some question, I want to impute using kNN with k=5 what they would have answered. The rationale I used is the following: for each year and course I filter my dataset to have students which did leave at least one answer (all of them for that course and year essentially). Starting with a student, I look at all their unanswered questions. For each of them I select all the other students that did answer that question instead and I compute the distance between the two students (before passing to the successive student), sorting the distances and taking the mean of the score of the other five. The distance I introduced was one highlighting how far the colleagues are in terms of answers: i.e the distances is 0 if they have the same score on the same answer, $1$ otherwise. In this way when I look at the sum of this 0 and 1 on all the questions and take the smallest 5, should represent(most likely) students similar to me in terms of behavior.

Feel free to comment or suggest a better way to tackle the problem. The dataset has ~400000 rows (i.e questions, with repetition for students) representing 5 years for 3 courses and ~6000 unique students and 6 columns, with a rate of abstentions of 20%. Is this considered a large dataset? The problem is that the following implementation takes more than 12h to finish and I wonder what could be improved to reduce the time to minutes if possible, or at least few hours. I think that the slow part is computing each time the distance between pair of students, but even if I use a list to save some score instead of computing it twice, I don't how much time I could spare. I tried to precompute the distances using a matrix, but that doesn't seems to work either. All the data is in a data.table format, which I red to be better in terms of speed.

Any help or improvement would be appreciated, I saw there is also knn function in R but didn't know whether I could pass my distance function as an argument.

compute_distance <- function(student_a, student_b, data_course, current_question) {
  # Get scores for all other questions except the current one
  scores_a <- data_course[student_id == student_a & question_id != current_question, SCORE]
  scores_b <- data_course[student_id == student_b & question_id != current_question, SCORE]
  
  # Ensure both score sets have the same questions
  if (length(scores_a) == length(scores_b)) {
    # Binary distance (0 if same score, 1 if different)
    dist <- sum(scores_a != scores_b)
    return(dist)
  } else {
    return(Inf)  # If questions don't match, return a large distance
  }
}
knn_impute <- function(data, k = 5) {
  
  # Start the entire process timer
  total_start_time <- Sys.time()
  
  # Iterate over each unique year
  for (year_x in unique(data$year)) {
    # Track the time for each year
    year_start_time <- Sys.time()
    
    # Print current year
    cat("Processing year:", year_x, "\n")
    
    # Filter data for this year
    data_year <- data[year == year_x]
    
    # Iterate over each unique course within the year
    for (course_x in unique(data_year$course)) {
      # Track the time for each course
      course_start_time <- Sys.time()
      
      # Print current course
      cat("  Processing course:", course_x, "\n")
      
      # Filter data for this course within the year
      data_course <- data_year[course == course_x]
      
      # Get unique unanswered students for this year and course
      unanswered_students <- unique(data_course[ans == 0, student_id])
      
      # Iterate over each student who has unanswered questions
      for (student_x in unanswered_students) {
        # Get all unanswered questions for student_x
        unanswered_questions <- data_course[student_id == student_x & ans == 0, question_id]
        
        # Iterate over each unanswered question for student_x
        for (question_x in unanswered_questions) {
          
          # Get all students who answered the current question
          answered_students <- data_course[question_id == question_x & ans != 0, student_id]
          
          # If no one answered the question, skip it
          if (length(answered_students) == 0) next
          
          # Calculate distances and store directly in a vector for sorting
          distances <- sapply(answered_students, function(student_y) {
            compute_distance(student_x, student_y, data_course, question_x)
          })
          
          # Sort distances and select the k nearest neighbors
          nearest_neighbors <- answered_students[order(distances)[1:min(k, length(distances))]]
          
          # Impute the score for student_x's unanswered question
          imputed_score <- mean(data_course[student_id %in% nearest_neighbors & question_id == question_x, score], na.rm = TRUE)
          
          # Update the score for student_x in the data
          data[student_id == student_x & question_id == question_x, score := imputed_score]
        }
      }
      
      # Calculate and print course duration
      course_end_time <- Sys.time()
      course_duration <- as.numeric(difftime(course_end_time, course_start_time, units = "secs"))
      cat("  Time spent on course:", course_x, "=", course_duration, "seconds\n")
    }
    
    # Calculate and print year duration
    year_end_time <- Sys.time()
    year_duration <- as.numeric(difftime(year_end_time, year_start_time, units = "secs"))
    cat("Time spent on year:", year_x, "=", year_duration, "seconds\n\n")
  }
  
  # Print total time spent
  total_end_time <- Sys.time()
  total_duration <- as.numeric(difftime(total_end_time, total_start_time, units = "secs"))
  cat("Total time spent on kNN imputation =", total_duration, "seconds\n")
  
  return(data)
}

# Call your function and run the imputation
imputed_data <- knn_impute(filtered_data, k = 5)

example:

student_id question_id course year ans score
1 10 Course_A 2019 1 3
2 15 Course_B 2020 0 0
3 20 Course_A 2021 1 5
4 25 Course_C 2019 1 4
5 30 Course_A 2020 0 0

The full CSV file with 10000 rows created with chat gpt working as example can be downloaded from this link.

Upvotes: 1

Views: 257

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

I tried my best to understand your logic and here is a refactored version, hope it could be faster

f <- function(d, k = 5) {
    lst <- split(d, ~ans)
    unanswered <- lst$`0`
    answered <- lst$`1`
    student_x <- unique(unanswered$student_id)
    question_x <- unique(unanswered$question_id)
    student_y <- with(answered, unique(student_id[question_id %in% question_x]))
    for (q in question_x) {
        u <- subset(d, question_id != q)
        for (sa in student_x) {
            score_a <- subset(u, student_id == sa)$score
            distances <- vector(mode = "numeric", length = length(student_y))
            for (sb_idx in seq_along(student_y)) {
                score_b <- subset(u, student_id == student_y[sb_idx])$score
                if (!length(score_b)) next
                distances[sb_idx] <- ifelse(length(score_a) != length(score_b), Inf, sum(score_a != score_b))
            }
            nearest_neighbors <- student_y[order(distances)[1:min(k, length(distances))]]
            imputed_score <- mean(subset(u, student_id %in% nearest_neighbors & question_id == q)$score, na.rm = TRUE)
            d[with(d, student_id == sa & question_id == q), "score"] <- imputed_score
        }
    }
    d
}

out <- do.call(
    rbind,
    lapply(
        unname(split(df, ~ year + course)),
        f
    )
)
out <- out[order(as.integer(row.names(out))), ]

where the big dataset is split into smaller chunks by year + course combos, and then apply the "KNN" to each subgroup. The gain is that, we don't need to index from the big dataframe, but just from the related subset, and thus it might be a bit faster.

Upvotes: 0

A. S. K.
A. S. K.

Reputation: 2816

I tried one approach, but with the test dataset provided, it actually turned out to be slower than your original solution. I suspect that the speed of the algorithm depends heavily on the distribution of missingness (for example, are non-answers clustered among specific questions/students, or are they randomly distributed?). Even so, I'm not optimistic that this version will be any faster than yours. But I'm posting it here just in case.

The important change here is that instead of computing distance between students who answered exactly the same set of questions, we instead compute distance between all pairs of students. If one student answered a question and the other didn't, we score that question as "different" for those students. If both students didn't answer a particular question, we score that question as "same" for those students. This isn't quite the same as your original metric, but it seems to me that it's similar in spirit. It also has two advantages:

  • We can pre-calculate similarity between students once, instead of re-calculating similarity for every individual question a student skipped.
  • We have a larger pool of available neighbors for each student. If two students answered all questions identically except that one student skipped question 4 and the other student skipped question 7, it seems to me that those students are quite similar and it would be useful to compare them.

The function below implements this change. I've eliminated the compute_distance() function and just folded the necessary code into the main function.

For efficiency, this function doesn't actually calculate distances between all pairs of students. Instead, for each student who skipped at least one question, it calculates the distance between that student and all other students who answered that question. The distance score will be affected by the fact that student x didn't answer the question and every student y did; however, since the distance score for each student y will be affected in the same way, this should not be a problem.

I made two other changes to the code:

  • In data.tables, the := operator assigns values by reference; this means that the function actually changes the original data argument. I've changed it so that it updates a copy of data instead so as to leave the original table unchanged for comparison.
  • The original update operation affected all records for that student and question id. However, this statement incorrectly targeted rows for other years and courses; I changed it so that it affects all records for the student, question id, year, and course.
knn_impute <- function(data, k = 5) {

  # Start the entire process timer
  total_start_time <- Sys.time()

  # Copy the data so we don't overwrite the original table
  data_copy = copy(data)

  # Iterate over each unique year
  for (year_x in unique(data$year)) {
    # Track the time for each year
    year_start_time <- Sys.time()

    # Print current year
    cat("Processing year:", year_x, "\n")

    # Filter data for this year
    data_year <- data[year == year_x]

    # Iterate over each unique course within the year
    for (course_x in unique(data_year$course)) {
      # Track the time for each course
      course_start_time <- Sys.time()

      # Print current course
      cat("  Processing course:", course_x, "\n")

      # Filter data for this course within the year
      data_course <- data_year[course == course_x]

      # Get unique unanswered students for this year and course
      unanswered_students <- unique(data_course[ans == 0, student_id])

      # Iterate over each student who has unanswered questions
      for (student_x in unanswered_students) {

        # Get all unanswered questions for student_x
        unanswered_questions <- data_course[student_id == student_x & ans == 0, question_id]

        # Get all other students who answered at least one question that the
        # current student didn't answer
        other_students <- sort(unique(data_course[student_id != student_x & question_id %in% unanswered_questions & ans != 0, student_id]))

        # For each other student who answered at least one question that the
        # current student didn't answer, get the distance between that student
        # and the current student
        scores_a <- data_course[student_id == student_x, score]
        scores_a <- ifelse(data_course[student_id == student_x, ans] == 0, -1, scores_a)
        distances <- sapply(other_students, function(student_y) {
          # Get the vector of answers for each student
          scores_b <- data_course[student_id == student_y, score]
          # Use -1 for unanswered questions
          scores_b <- ifelse(data_course[student_id == student_y, ans] == 0, -1, scores_b)
          # Return the number of mismatches between the two students
          return(sum(scores_a != scores_b))
        })

        # Iterate over each unanswered question for student_x
        for (question_x in unanswered_questions) {

          # Get all students who answered the current question
          answered_students <- sort(unique(data_course[question_id == question_x & ans != 0, student_id]))

          # If no one answered the question, skip it
          if (length(answered_students) == 0) next

          # Get distances for students who answered the question
          q_distances <- distances[other_students %in% answered_students]

          # Sort distances and select the k nearest neighbors
          nearest_neighbors <- answered_students[order(q_distances)[1:min(k, length(q_distances))]]

          # Impute the score for student_x's unanswered question
          imputed_score <- mean(data_course[student_id %in% nearest_neighbors & question_id == question_x, score], na.rm = TRUE)

          # Update the score for student_x in the data
          data_copy[year == year_x & course == course_x & student_id == student_x & question_id == question_x, score := imputed_score]
        }
      }

      # Calculate and print course duration
      course_end_time <- Sys.time()
      course_duration <- as.numeric(difftime(course_end_time, course_start_time, units = "secs"))
      cat("  Time spent on course:", course_x, "=", course_duration, "seconds\n")
    }

    # Calculate and print year duration
    year_end_time <- Sys.time()
    year_duration <- as.numeric(difftime(year_end_time, year_start_time, units = "secs"))
    cat("Time spent on year:", year_x, "=", year_duration, "seconds\n\n")
  }

  # Print total time spent
  total_end_time <- Sys.time()
  total_duration <- as.numeric(difftime(total_end_time, total_start_time, units = "secs"))
  cat("Total time spent on kNN imputation =", total_duration, "seconds\n")

  return(data_copy)
}

Upvotes: 0

Related Questions