oneself
oneself

Reputation: 40231

Aggregate by string column name in R

I would like to group data in a data.frame by two columns and then sum a specific third column. For example:

> aggregate(mpg~gear+cyl, data=mtcars, FUN=sum)
  gear cyl   mpg
1    3   4  21.5
2    4   4 215.4
3    5   4  56.4
4    3   6  39.5
5    4   6  79.0
6    5   6  19.7
7    3   8 180.6
8    5   8  30.8

Now, I need to do this several times for different columns. So I would like to write a function which generalizes this. It take the data.frame and one of the columns (to keep things simple) and does the same thing.

agg.data <- function(df, colname) {
  aggregate(mpg~gear+colname, data=df, FUN=sum) 
}

Running this will produce:

Error in eval(expr, envir, enclos) : object 'colname' not found

How can I pass in the value of colname to aggregate?

Upvotes: 4

Views: 3723

Answers (6)

Murilo Vianna
Murilo Vianna

Reputation: 1

You should probably find a function for that on tidyverse, however, there goes a function I often use for summary aggregation. It embbeds much of what was discussed above with formula() but in a more general form:

get.stat = function(df,var.nm,agg.id){

  #--- df       data.frame for aggregation
  #--- var.nm   target variable to be aggregated
  #--- agg.id   index name for aggretation (single value or vector)

  #--- generate aggregation formula
  agg.formula = paste(agg.id, collapse = "+")

  #--- df with summary results
  df.res = data.frame(mean    = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, mean  )[,length(agg.id)+1],
                      median  = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, median)[,length(agg.id)+1],
                      sd      = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, sd    )[,length(agg.id)+1],
                      min     = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, min   )[,length(agg.id)+1],
                      max     = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, max   )[,length(agg.id)+1],
                      sum     = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, sum   )[,length(agg.id)+1],
                      count   = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, length)[,length(agg.id)+1])

  #--- bind indexers
  for(c in 1:length(agg.id)){
    df.res = cbind(df.res, aggregate(formula(paste0(var.nm,"~",agg.formula)), df, mean)[,c])
    colnames(df.res)[length(colnames(df.res))] = agg.id[c]
  }

  #--- re-order col
  df.res = df.res[,c(agg.id,colnames(df.res)[1:(length(colnames(df.res)) - c)])]

  return(df.res)
}

With that function loaded, you can simply:

get.stat(df, "mpg",c("gear","cyl"))

Upvotes: 0

talat
talat

Reputation: 70246

You can easily use the "normal" aggregate interface (i.e. not the formula interface) to supply column names in variables. The syntax is slightly different but still easy enough and doesn't require pasting:

agg.data2 <- function(df, colname) {
  aggregate(df[["mpg"]], list(df[["gear"]], df[[colname]]), FUN=sum) 
}
agg.data2(mtcars, "cyl")
#  Group.1 Group.2     x
#1       3       4  21.5
#2       4       4 215.4
#3       5       4  56.4
#4       3       6  39.5
#5       4       6  79.0
#6       5       6  19.7
#7       3       8 180.6
#8       5       8  30.8

Here's the dplyr equivalent:

library(dplyr)
agg.data.dplyr <- function(df, colname) {
  df %>%
    group_by_(.dots = c("gear", colname)) %>%
    summarise(sum = sum(mpg)) %>%
    ungroup()
}
agg.data.dplyr(mtcars, "cyl")

Upvotes: 1

Arun
Arun

Reputation: 118779

Using data.table:

fun.dt <- function(dt, col) {
    dt[, .(mpg=sum(mpg)), by=c("gear", col)]
}

require(data.table)
dt = as.data.table(mtcars)
fun.dt(dt, "cyl")
#    gear cyl   mpg
# 1:    4   6  79.0
# 2:    4   4 215.4
# 3:    3   6  39.5
# 4:    3   8 180.6
# 5:    3   4  21.5
# 6:    5   4  56.4
# 7:    5   8  30.8
# 8:    5   6  19.7

The by expression in data.tables can also take a character vector of column names in addition to lists of columns/expressions. We can simply provide a character vector to the by argument.

Upvotes: 3

cory
cory

Reputation: 6659

Paste together a string representation of your formula, and give that string as an argument to formula()...

agg.data <- function(df, colname) {
  aggregate(formula(paste0("mpg~gear+", colname)), data=df, FUN=sum) 
}

> agg.data(mtcars, "cyl")
  gear cyl   mpg
1    3   4  21.5
2    4   4 215.4
3    5   4  56.4
4    3   6  39.5
5    4   6  79.0
6    5   6  19.7
7    3   8 180.6
8    5   8  30.8

Upvotes: 10

David Arenburg
David Arenburg

Reputation: 92282

You can also pass an unquoted column name using deparse and substitute

agg.data <- function(df, colname) {
  aggregate(df$mpg, list(df$gear, df[, deparse(substitute(colname))]), FUN=sum) 
}

agg.data(mtcars, cyl)
#   Group.1 Group.2     x
# 1       3       4  21.5
# 2       4       4 215.4
# 3       5       4  56.4
# 4       3       6  39.5
# 5       4       6  79.0
# 6       5       6  19.7
# 7       3       8 180.6
# 8       5       8  30.8

Upvotes: 1

OganM
OganM

Reputation: 2663

You can also do this in the style of ggplot or with that allows you to just write the colnames as they are without passing a string by using substitute.

agg.data3 = function (df, colname){
    colname = substitute(colname)
    colname = as.character(colname)
    aggregate(formula(paste0("mpg~gear+", colname)), data=mtcars, FUN=sum)
}

usage

agg.data3(cars, cyl)

Upvotes: 0

Related Questions