Reputation: 29
I've been working on this for a while and I can't really find any solution that accomplishes what I need...
Putting it simple, I have a DF with two columns, let's say, 'n' different rows (e.g. names of students), and 'm' different names of supervisors. 'n' can be larger or smaller than 'm'.
The problem:
Randomly assign the 'n' students into supervisor's groups 'm' such that every supervisor receives the same number of students in his group (or as close as possible if 'n' and 'm' are not multiples).
Some students are already assigned to a particular supervisor before hand. This is, some groups are empty at the beginning, and some already have some rows assigned.
The limit of rows per group is the same for every group, this is, round(n/m)
.
Rows 'n' that are already assigned into one group CAN'T be switched to a new group.
So far I've been trying to sort the problem with dplyr, playing with different tables, assigning indices for each observation... but I feel my code is way too complicated for this type of problem, so I'm wondering if someone knows a simpler solution.
I'll leave a sample of my data frame for visual purposes. Of course I'm dealing with a much bigger data set with different type of info. but the problem is exactly the same:
I have:
Names_stud (n) Supervisors (m)
Ralph SKINNER
Michael NA
Mitch NA
Julen NA
Richard CARAPAPEL
John NA
Ramon SKINNER
Laura McGONAGALL
Paul NA
Ivy NA
Lucas NA
Mathiew NA
What I would like to have:
Names_students Supervisor
Ralph SKINNER
Michael CARAPAPEL
Mitch SKINNER
Julen McGONAGALL
Richard CARAPAPEL
John CARAPAPEL
Ramon SKINNER
Laura McGONAGALL
Paul McGONAGALL
Ivy SKINNER
Lucas McGONAGALL
Mathiew CARAPAPEL
Such that:
table(DF$Supervisors)
McGONAGALL SKINNER CARAPAPEL
4 4 4
In case 'n' is not a multiple of 'm' is perfectly OK to have the closest result the this one (e.g. 4, 3, 3, or 4, 4, 3...).
So far I've done a lot of coding with dplyr, assigning indices to students previously assigned... but I always get stuck somewhere and I feel the way I deal with it is supper inefficient.
I am wondering if someone knows a specific solution to deal with this. I've also checked the 'split' package. Could't find anything useful for this purpose there.
Thank you very much in advance. If you need any further clarification, please just ask.
PD: I couldn't find any related question to this specific problem. If there's one with a proper answer please let me know.
Again, thank in advance.
Upvotes: 2
Views: 407
Reputation: 354
Here's another approach which I think will be a bit more efficient than MR's, and hopefully just as easy to understand.
This is an assignment problem (bipartite matching), but the constraints are sufficiently simple that you can get away without using heavier algorithms or dedicated integer programming tools.
The strategy here is to generate the "right hand side" of the assignment for those students that don't already have a supervisor, and then row bind these new assignments with the existing ones.
To do this, we create a cyclic vector of supervisors that is sufficiently long, and then drop supervisors starting from the top of this vector that already have students to ensure the final groups are balanced.
set.seed(1)
n <- 10
m <- 3
# Initialise our students and supervisors
students <- sample(letters, n, replace = FALSE)
supers <- sample(LETTERS, m, replace = FALSE)
# Create your dataframe and randomly assign a few supers
df <- data.frame(student = students,
super = NA, stringsAsFactors = FALSE)
df[sample(1:n, 2), "super"] <- sample(supers, 2)
# Each supervisor must be assigned to [floor(n / m), ceiling(n / m)] students
# We can ensure this by cycling through supervisors...
to_assign <- rep(supers, ceiling(n / m))
# ... but only if we skip those that have already been assigned to a student
for (super in df[!is.na(df$super), "super"]) {
to_assign <- to_assign[-match(super, to_assign)]
}
new_assignments <- df[is.na(df$super), ]
new_assignments$super <- to_assign[1:nrow(new_assignments)]
result <- rbind(df[!is.na(df$super), ], new_assignments)
I don't think loops should be avoided for the sake of avoiding loops, in this case I think it's fine and produces simple code, but you could probably do better with some more clever use of data structures.
Upvotes: 0
Reputation: 317
I think this may work for what you want to do. myFun just makes a list of random 'students' and sample generates an unevenly weighted list of 'advisors' with ~70% NA values. Then the for loop fills in the NA's with the advisor that has the lowest value on the table call.
If anyone can do this in a more R-appropriate way without a for loop I'd be really interested to see it.
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
students <- myFun(50)
advisors <- sample(c("TA1", "TA2", "TA3", NA), size = 50, replace = TRUE, prob = c(0.1, 0.2, 0.1, 0.7))
datFrame <- data.frame(students, advisors)
for(i in 1:length(datFrame$advisors)){
ifelse(is.na(datFrame$advisors[i]),
datFrame$advisors[i] <- names(table(datFrame$advisors))[which.min(table(datFrame$advisors))],
datFrame$advisors[i] <- datFrame$advisors[i])
}
table(datFrame$advisors)
Upvotes: 0