DRZP
DRZP

Reputation: 29

Assign/Split rows into equal size groups up to a certain threshold per group

I've been working on this for a while and I can't really find any solution that accomplishes what I need...

Putting it simple, I have a DF with two columns, let's say, 'n' different rows (e.g. names of students), and 'm' different names of supervisors. 'n' can be larger or smaller than 'm'.

The problem:

  1. Randomly assign the 'n' students into supervisor's groups 'm' such that every supervisor receives the same number of students in his group (or as close as possible if 'n' and 'm' are not multiples).

  2. Some students are already assigned to a particular supervisor before hand. This is, some groups are empty at the beginning, and some already have some rows assigned.

  3. The limit of rows per group is the same for every group, this is, round(n/m).

  4. Rows 'n' that are already assigned into one group CAN'T be switched to a new group.

So far I've been trying to sort the problem with dplyr, playing with different tables, assigning indices for each observation... but I feel my code is way too complicated for this type of problem, so I'm wondering if someone knows a simpler solution.

I'll leave a sample of my data frame for visual purposes. Of course I'm dealing with a much bigger data set with different type of info. but the problem is exactly the same:

I have:

Names_stud (n)     Supervisors (m)
Ralph              SKINNER
Michael            NA
Mitch              NA
Julen              NA
Richard            CARAPAPEL        
John               NA
Ramon              SKINNER
Laura              McGONAGALL
Paul               NA
Ivy                NA
Lucas              NA
Mathiew            NA

What I would like to have:

Names_students     Supervisor
Ralph              SKINNER
Michael            CARAPAPEL
Mitch              SKINNER
Julen              McGONAGALL
Richard            CARAPAPEL        
John               CARAPAPEL
Ramon              SKINNER
Laura              McGONAGALL
Paul               McGONAGALL
Ivy                SKINNER
Lucas              McGONAGALL
Mathiew            CARAPAPEL

Such that:

table(DF$Supervisors)

McGONAGALL     SKINNER     CARAPAPEL
         4           4             4

In case 'n' is not a multiple of 'm' is perfectly OK to have the closest result the this one (e.g. 4, 3, 3, or 4, 4, 3...).

So far I've done a lot of coding with dplyr, assigning indices to students previously assigned... but I always get stuck somewhere and I feel the way I deal with it is supper inefficient.

I am wondering if someone knows a specific solution to deal with this. I've also checked the 'split' package. Could't find anything useful for this purpose there.

Thank you very much in advance. If you need any further clarification, please just ask.

PD: I couldn't find any related question to this specific problem. If there's one with a proper answer please let me know.

Again, thank in advance.

Upvotes: 2

Views: 407

Answers (2)

Callum Webb
Callum Webb

Reputation: 354

Here's another approach which I think will be a bit more efficient than MR's, and hopefully just as easy to understand.

This is an assignment problem (bipartite matching), but the constraints are sufficiently simple that you can get away without using heavier algorithms or dedicated integer programming tools.

The strategy here is to generate the "right hand side" of the assignment for those students that don't already have a supervisor, and then row bind these new assignments with the existing ones.

To do this, we create a cyclic vector of supervisors that is sufficiently long, and then drop supervisors starting from the top of this vector that already have students to ensure the final groups are balanced.

set.seed(1)

n <- 10
m <- 3

# Initialise our students and supervisors
students <- sample(letters, n, replace = FALSE)
supers <- sample(LETTERS, m, replace = FALSE)

# Create your dataframe and randomly assign a few supers
df <- data.frame(student = students,
                 super = NA, stringsAsFactors = FALSE)
df[sample(1:n, 2), "super"] <- sample(supers, 2)

# Each supervisor must be assigned to [floor(n / m), ceiling(n / m)] students
# We can ensure this by cycling through supervisors... 
to_assign <- rep(supers, ceiling(n / m))

# ... but only if we skip those that have already been assigned to a student
for (super in df[!is.na(df$super), "super"]) {
  to_assign <- to_assign[-match(super, to_assign)]
}

new_assignments <- df[is.na(df$super), ]
new_assignments$super <- to_assign[1:nrow(new_assignments)]

result <- rbind(df[!is.na(df$super), ], new_assignments)

I don't think loops should be avoided for the sake of avoiding loops, in this case I think it's fine and produces simple code, but you could probably do better with some more clever use of data structures.

Upvotes: 0

mikeHoncho
mikeHoncho

Reputation: 317

I think this may work for what you want to do. myFun just makes a list of random 'students' and sample generates an unevenly weighted list of 'advisors' with ~70% NA values. Then the for loop fills in the NA's with the advisor that has the lowest value on the table call.

If anyone can do this in a more R-appropriate way without a for loop I'd be really interested to see it.

myFun <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

students <- myFun(50)
advisors <- sample(c("TA1", "TA2", "TA3", NA), size = 50, replace = TRUE, prob = c(0.1, 0.2, 0.1, 0.7))
datFrame <- data.frame(students, advisors)

for(i in 1:length(datFrame$advisors)){
  ifelse(is.na(datFrame$advisors[i]),
         datFrame$advisors[i] <- names(table(datFrame$advisors))[which.min(table(datFrame$advisors))],
         datFrame$advisors[i] <- datFrame$advisors[i])
}

table(datFrame$advisors)

Upvotes: 0

Related Questions