Reputation: 40231
I would like to group data in a data.frame by two columns and then sum a specific third column. For example:
> aggregate(mpg~gear+cyl, data=mtcars, FUN=sum)
gear cyl mpg
1 3 4 21.5
2 4 4 215.4
3 5 4 56.4
4 3 6 39.5
5 4 6 79.0
6 5 6 19.7
7 3 8 180.6
8 5 8 30.8
Now, I need to do this several times for different columns. So I would like to write a function which generalizes this. It take the data.frame and one of the columns (to keep things simple) and does the same thing.
agg.data <- function(df, colname) {
aggregate(mpg~gear+colname, data=df, FUN=sum)
}
Running this will produce:
Error in eval(expr, envir, enclos) : object 'colname' not found
How can I pass in the value of colname to aggregate?
Upvotes: 4
Views: 3723
Reputation: 1
You should probably find a function for that on tidyverse, however, there goes a function I often use for summary aggregation. It embbeds much of what was discussed above with formula()
but in a more general form:
get.stat = function(df,var.nm,agg.id){
#--- df data.frame for aggregation
#--- var.nm target variable to be aggregated
#--- agg.id index name for aggretation (single value or vector)
#--- generate aggregation formula
agg.formula = paste(agg.id, collapse = "+")
#--- df with summary results
df.res = data.frame(mean = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, mean )[,length(agg.id)+1],
median = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, median)[,length(agg.id)+1],
sd = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, sd )[,length(agg.id)+1],
min = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, min )[,length(agg.id)+1],
max = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, max )[,length(agg.id)+1],
sum = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, sum )[,length(agg.id)+1],
count = aggregate(formula(paste0(var.nm,"~",agg.formula)), df, length)[,length(agg.id)+1])
#--- bind indexers
for(c in 1:length(agg.id)){
df.res = cbind(df.res, aggregate(formula(paste0(var.nm,"~",agg.formula)), df, mean)[,c])
colnames(df.res)[length(colnames(df.res))] = agg.id[c]
}
#--- re-order col
df.res = df.res[,c(agg.id,colnames(df.res)[1:(length(colnames(df.res)) - c)])]
return(df.res)
}
With that function loaded, you can simply:
get.stat(df, "mpg",c("gear","cyl"))
Upvotes: 0
Reputation: 70246
You can easily use the "normal" aggregate
interface (i.e. not the formula interface) to supply column names in variables. The syntax is slightly different but still easy enough and doesn't require pasting:
agg.data2 <- function(df, colname) {
aggregate(df[["mpg"]], list(df[["gear"]], df[[colname]]), FUN=sum)
}
agg.data2(mtcars, "cyl")
# Group.1 Group.2 x
#1 3 4 21.5
#2 4 4 215.4
#3 5 4 56.4
#4 3 6 39.5
#5 4 6 79.0
#6 5 6 19.7
#7 3 8 180.6
#8 5 8 30.8
Here's the dplyr equivalent:
library(dplyr)
agg.data.dplyr <- function(df, colname) {
df %>%
group_by_(.dots = c("gear", colname)) %>%
summarise(sum = sum(mpg)) %>%
ungroup()
}
agg.data.dplyr(mtcars, "cyl")
Upvotes: 1
Reputation: 118779
Using data.table
:
fun.dt <- function(dt, col) {
dt[, .(mpg=sum(mpg)), by=c("gear", col)]
}
require(data.table)
dt = as.data.table(mtcars)
fun.dt(dt, "cyl")
# gear cyl mpg
# 1: 4 6 79.0
# 2: 4 4 215.4
# 3: 3 6 39.5
# 4: 3 8 180.6
# 5: 3 4 21.5
# 6: 5 4 56.4
# 7: 5 8 30.8
# 8: 5 6 19.7
The by
expression in data.tables can also take a character vector of column names in addition to lists of columns/expressions. We can simply provide a character vector to the by
argument.
Upvotes: 3
Reputation: 6659
Paste together a string representation of your formula, and give that string as an argument to formula()...
agg.data <- function(df, colname) {
aggregate(formula(paste0("mpg~gear+", colname)), data=df, FUN=sum)
}
> agg.data(mtcars, "cyl")
gear cyl mpg
1 3 4 21.5
2 4 4 215.4
3 5 4 56.4
4 3 6 39.5
5 4 6 79.0
6 5 6 19.7
7 3 8 180.6
8 5 8 30.8
Upvotes: 10
Reputation: 92282
You can also pass an unquoted column name using deparse
and substitute
agg.data <- function(df, colname) {
aggregate(df$mpg, list(df$gear, df[, deparse(substitute(colname))]), FUN=sum)
}
agg.data(mtcars, cyl)
# Group.1 Group.2 x
# 1 3 4 21.5
# 2 4 4 215.4
# 3 5 4 56.4
# 4 3 6 39.5
# 5 4 6 79.0
# 6 5 6 19.7
# 7 3 8 180.6
# 8 5 8 30.8
Upvotes: 1
Reputation: 2663
You can also do this in the style of ggplot
or with
that allows you to just write the colnames as they are without passing a string by using substitute
.
agg.data3 = function (df, colname){
colname = substitute(colname)
colname = as.character(colname)
aggregate(formula(paste0("mpg~gear+", colname)), data=mtcars, FUN=sum)
}
usage
agg.data3(cars, cyl)
Upvotes: 0