Reputation: 507
I didn't see another question quite like this on SO. Let's say I have a dataframe of consisting of 1...m = M variables I want to aggregate over, plus two other grouping variables I plan to aggregate by (m+2 columns total). One of the variables in M needs to be averaged, and the other m-1 should be summed.
I'm familiar with various ways to aggregate data in R, including performing different functions on separate columns. For example, I like this data.table
method:
df <- df[, list(w = sum(w), x = mean(x)), by = c("y", "z")]
which sums w and averages y by two groups, y,z.
After some research, it also looks like there's a dplyr
function across(-x, sum)
that can subject all variables meeting a condition to an aggregation function -- in this case, sum all variables other than x.
I'm having trouble figuring out how to use these together. In my case, I have > 1,000,000 rows and 200+ columns, so it's not feasible to cbind()
everything on the LHS of an aggregate()
-type function.
mean
, sum
) with an across()
technique? (My instinct is it might involve some merging.)data.table
solution? (Not necessary, but helpful for speed.)MWE: Consider the iris dataset.
library(datasets)
data(iris)
summary(iris)
which has four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one factor (Species).
I want to group by species, and within that group, aggregate each of the other four columns. However, Sepal.Length should be averaged (mean
), while Sepal.Width, Petal.Length, Petal.Width should be summed (summed
).
I can do this by naming each variable as follows:
library(data.table)
iris <- data.table(iris)
iris[, list(Sepal.Length = mean(Sepal.Length),
Sepal.Width = sum(Sepal.Width),
Petal.Length = sum(Petal.Length),
Petal.Width = sum(Petal.Width)), by = c("Species")]
How can I generalize the procedure so that it's only necessary to name the exceptions (mean(Sepal.Length)
)?
As an alternative to naming only the exceptions (in the manner of -Sepal.Length
), I could also exhaustively name all the indices; eg:
iris[, list(iris[,1] = mean(iris[,1]),
iris[,2:4] = sum(iris[,2:4])), by = c(Species)]
# doesn't work of course
Upvotes: 0
Views: 66
Reputation: 5887
In data.table
you could do something like this, note that lapply
over one column for the mean here is a bit silly, but this is a pretty universal way if you want to apply different functions over a set of columns.
iris[, lapply(.SD[, 1], mean), Species][iris[, lapply(.SD[, 2:4], sum), Species], on = "Species"]
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1: setosa 5.006 171.4 73.1 12.3
# 2: versicolor 5.936 138.5 213.0 66.3
# 3: virginica 6.588 148.7 277.6 101.3
Upvotes: 1
Reputation: 145755
One of the great things about dplyr::across
is that it doesn't have to be used in isolation.
your_data %>%
group_by(group1, group2) %>%
summarize(across(-col_to_mean, sum), mean_col = mean(col_to_mean))
That works perfectly in my head, but if you want it tested please do provide a MWE.
Upvotes: 2