Kevin Burnham
Kevin Burnham

Reputation: 563

How do I pass mulitple columns to a function within dplyr::summarize

I am trying to pass all columns from a data.frame matching a criteria to a function within the summarize function of dplyr as follows:

df %>% group_by(Version, Type) %>%
  summarize(mcll(TrueClass, starts_with("pred")))

Error: argument is of length zero

Is there a way to do this? A working example follows:

Build a simulated data.frame of sample predictions. These are interpreted as the output of a classification algorithm.

library(dplyr)
nrow <- 40
ncol <- 4
set.seed(567879)

getProbs <- function(i) {
  p <- runif(i)
  return(p / sum(p))
}
df <- data.frame(matrix(NA, nrow, ncol))
for (i in seq(nrow)) df[i, ] <- getProbs(ncol)
names(df) <- paste0("pred.", seq(ncol))

add a column indicating the true class

df$TrueClass <- factor(ceiling(runif(nrow, min = 0, max = ncol)))

add categorical columns for sub-setting

df$Type <- c(rep("a", nrow / 2), rep("b", nrow / 2))
df$Version <-  rep(1:4, times = nrow / 4)

now I want to calculate the Multiclass LogLoss for these predictions using the function below:

mcll <- function (act, pred) 
{
  if (class(act) != "factor") {
    stop("act must be a factor")
  }
  pred[pred == 0] <- 1e-15
  pred[pred == 1] <- 1 - 1e-15
  dummies <- model.matrix(~act - 1)
  if (nrow(dummies) != nrow(pred)) {
    return(0)
  }
  return(-1 * (sum(dummies * log(pred)))/length(act))
}

this is easily done with the entire data set

act <- df$TrueClass
pred <- df %>% select(starts_with("pred"))
mcll(act, pred)

but I want to use dplyr group_by to calculate mcll for each subset of the data

df %>% group_by(Version, Type) %>%
  summarize(mcll(TrueClass, starts_with("pred")))

Ideally I could do this without changing the mcll() function, but I am open to doing that if it simplifies the other code.

Thanks!

EDIT: Note that the input to mcll is a vector of true values and a matrix of probabilities with one column for each "pred" column. For each subset of data, mcll should return a scalar. I can get exactly what I want with the code below, but I was hoping for something within the context of dplyr.

mcll_df <- data.frame(matrix(ncol = 3, nrow = 8))
names(mcll_df) <- c("Type", "Version", "mcll")
count = 1
for (ver in unique(df$Version)) {
  for (type in unique(df$Type)) {
    subdat <- df %>% filter(Type == type & Version == ver)
    val <- mcll(subdat$TrueClass, subdat %>% select(starts_with("pred")))
    mcll_df[count, ] <- c(Type = type, Version = ver, mcll = val)
    count = count + 1
  }
}
head(mcll_df)
  Type Version             mcll
1    a       1 1.42972507510096
2    b       1 1.97189000832723
3    a       2 1.97988830406062
4    b       2 1.21387875938737
5    a       3 1.30629638026735
6    b       3 1.48799237895462

Upvotes: 4

Views: 344

Answers (2)

eddi
eddi

Reputation: 49448

This is easy to do using data.table:

library(data.table)

setDT(df)[, mcll(TrueClass, .SD), by = .(Version, Type), .SDcols = grep("^pred", names(df))] 
#   Version Type       V1
#1:       1    a 1.429725
#2:       2    a 1.979888
#3:       3    a 1.306296
#4:       4    a 1.668330
#5:       1    b 1.971890
#6:       2    b 1.213879
#7:       3    b 1.487992
#8:       4    b 1.171286

Upvotes: 2

mfidino
mfidino

Reputation: 3055

I had to change the mcll function a little bit but then it worked. The problem is occurring with the second if statement. You are telling the function to get nrow(pred), but if you are summarizing over multiple columns you are actually only supplying a vector each time (because each column gets analyzed separately). Additionally, I switched the order of the arguments being entered into the function.

mcll <- function (pred, act) 
{
  if (class(act) != "factor") {
    stop("act must be a factor")
  }
   pred[pred == 0] <- 1e-15
   pred[pred == 1] <- 1 - 1e-15

  dummies <- model.matrix(~act - 1)
  if (nrow(dummies) != length(pred)) { # the main change is here
    return(0)
  }
  return(-1 * (sum(dummies * log(pred)))/length(act))
}

From there we can use the summarise_each function.

df %>% group_by(Version,Type) %>% summarise_each(funs(mcll(., TrueClass)), matches("pred"))

  Version  Type   pred.1   pred.2   pred.3   pred.4
    (int) (chr)    (dbl)    (dbl)    (dbl)    (dbl)
1       1     a 1.475232 1.972779 1.743491 1.161984
2       1     b 2.030829 1.331629 1.397577 1.484865
3       2     a 1.589256 1.740858 1.898906 2.005511

I checked this against a subset of the data and it looks like it works.

mcll(df$pred.1[which(df$Type=="a" & df$Version==1)],
 df$TrueClass[which(df$Type=="a" & df$Version==1)])

[1] 1.475232 #pred.1 mcll when Version equals 1 and Type equals a.

Upvotes: 0

Related Questions