How do I pass mulitple columns to a function within dplyr::summarize

Question

I am trying to pass all columns from a data.frame matching a criteria to a function within the summarize function of dplyr as follows:

df %>% group_by(Version, Type) %>%
  summarize(mcll(TrueClass, starts_with("pred")))

Error: argument is of length zero

Is there a way to do this? A working example follows:

Build a simulated data.frame of sample predictions. These are interpreted as the output of a classification algorithm.

library(dplyr)
nrow <- 40
ncol <- 4
set.seed(567879)

getProbs <- function(i) {
  p <- runif(i)
  return(p / sum(p))
}
df <- data.frame(matrix(NA, nrow, ncol))
for (i in seq(nrow)) df[i, ] <- getProbs(ncol)
names(df) <- paste0("pred.", seq(ncol))

add a column indicating the true class

df$TrueClass <- factor(ceiling(runif(nrow, min = 0, max = ncol)))

add categorical columns for sub-setting

df$Type <- c(rep("a", nrow / 2), rep("b", nrow / 2))
df$Version <-  rep(1:4, times = nrow / 4)

now I want to calculate the Multiclass LogLoss for these predictions using the function below:

mcll <- function (act, pred) 
{
  if (class(act) != "factor") {
    stop("act must be a factor")
  }
  pred[pred == 0] <- 1e-15
  pred[pred == 1] <- 1 - 1e-15
  dummies <- model.matrix(~act - 1)
  if (nrow(dummies) != nrow(pred)) {
    return(0)
  }
  return(-1 * (sum(dummies * log(pred)))/length(act))
}

this is easily done with the entire data set

act <- df$TrueClass
pred <- df %>% select(starts_with("pred"))
mcll(act, pred)

but I want to use dplyr group_by to calculate mcll for each subset of the data

df %>% group_by(Version, Type) %>%
  summarize(mcll(TrueClass, starts_with("pred")))

Ideally I could do this without changing the mcll() function, but I am open to doing that if it simplifies the other code.

Thanks!

EDIT: Note that the input to mcll is a vector of true values and a matrix of probabilities with one column for each "pred" column. For each subset of data, mcll should return a scalar. I can get exactly what I want with the code below, but I was hoping for something within the context of dplyr.

mcll_df <- data.frame(matrix(ncol = 3, nrow = 8))
names(mcll_df) <- c("Type", "Version", "mcll")
count = 1
for (ver in unique(df$Version)) {
  for (type in unique(df$Type)) {
    subdat <- df %>% filter(Type == type & Version == ver)
    val <- mcll(subdat$TrueClass, subdat %>% select(starts_with("pred")))
    mcll_df[count, ] <- c(Type = type, Version = ver, mcll = val)
    count = count + 1
  }
}
head(mcll_df)
  Type Version             mcll
1    a       1 1.42972507510096
2    b       1 1.97189000832723
3    a       2 1.97988830406062
4    b       2 1.21387875938737
5    a       3 1.30629638026735
6    b       3 1.48799237895462

eddi · Accepted Answer

This is easy to do using data.table:

library(data.table)

setDT(df)[, mcll(TrueClass, .SD), by = .(Version, Type), .SDcols = grep("^pred", names(df))] 
#   Version Type       V1
#1:       1    a 1.429725
#2:       2    a 1.979888
#3:       3    a 1.306296
#4:       4    a 1.668330
#5:       1    b 1.971890
#6:       2    b 1.213879
#7:       3    b 1.487992
#8:       4    b 1.171286

How do I pass mulitple columns to a function within dplyr::summarize

Answers (2)

Related Questions