beddotcom
beddotcom

Reputation: 507

Aggregate by averaging m-1 columns and summing remainder

I didn't see another question quite like this on SO. Let's say I have a dataframe of consisting of 1...m = M variables I want to aggregate over, plus two other grouping variables I plan to aggregate by (m+2 columns total). One of the variables in M needs to be averaged, and the other m-1 should be summed.

I'm familiar with various ways to aggregate data in R, including performing different functions on separate columns. For example, I like this data.table method:

df <- df[, list(w = sum(w), x = mean(x)), by = c("y", "z")]

which sums w and averages y by two groups, y,z.

After some research, it also looks like there's a dplyr function across(-x, sum) that can subject all variables meeting a condition to an aggregation function -- in this case, sum all variables other than x.

I'm having trouble figuring out how to use these together. In my case, I have > 1,000,000 rows and 200+ columns, so it's not feasible to cbind() everything on the LHS of an aggregate()-type function.

MWE: Consider the iris dataset.

library(datasets)
data(iris)
summary(iris)

which has four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one factor (Species).

I want to group by species, and within that group, aggregate each of the other four columns. However, Sepal.Length should be averaged (mean), while Sepal.Width, Petal.Length, Petal.Width should be summed (summed).

I can do this by naming each variable as follows:

library(data.table)
iris <- data.table(iris)
iris[, list(Sepal.Length = mean(Sepal.Length), 
            Sepal.Width = sum(Sepal.Width),
            Petal.Length = sum(Petal.Length),
            Petal.Width = sum(Petal.Width)), by = c("Species")]

How can I generalize the procedure so that it's only necessary to name the exceptions (mean(Sepal.Length))?

As an alternative to naming only the exceptions (in the manner of -Sepal.Length), I could also exhaustively name all the indices; eg:

iris[, list(iris[,1] = mean(iris[,1]), 
            iris[,2:4] = sum(iris[,2:4])), by = c(Species)] 
# doesn't work of course

Upvotes: 0

Views: 66

Answers (2)

Merijn van Tilborg
Merijn van Tilborg

Reputation: 5887

In data.table you could do something like this, note that lapply over one column for the mean here is a bit silly, but this is a pretty universal way if you want to apply different functions over a set of columns.

iris[, lapply(.SD[, 1], mean), Species][iris[, lapply(.SD[, 2:4], sum), Species], on = "Species"]

#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa        5.006       171.4         73.1        12.3
# 2: versicolor        5.936       138.5        213.0        66.3
# 3:  virginica        6.588       148.7        277.6       101.3

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145755

One of the great things about dplyr::across is that it doesn't have to be used in isolation.

your_data %>% 
  group_by(group1, group2) %>%
  summarize(across(-col_to_mean, sum), mean_col = mean(col_to_mean))

That works perfectly in my head, but if you want it tested please do provide a MWE.

Upvotes: 2

Related Questions