Reputation: 7986
From a data frame, is there a easy way to aggregate (sum
, mean
, max
etc) multiple variables simultaneously?
Below are some sample data:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
I would like to simultaneously aggregate the x1
and x2
variables from the df2
data frame by year and month. The following code aggregates the x1
variable, but is it also possible to simultaneously aggregate the x2
variable?
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
Upvotes: 206
Views: 234788
Reputation: 79
Late to the party, but recently found another way to get the summary statistics.
library(psych)
describe(data)
describe.by(column, group = grouped_column)
Will output: mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.
Upvotes: -1
Reputation: 83275
With the dplyr package, you can use across()
to aggregate multiple variables using tidyselect language. For the example dataset you can do this as follows:
library(dplyr)
set.seed(13)
# summarising all non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(everything(), n_distinct))
# summarising a specific set of non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(x1:x2, sum))
df1 %>% group_by(year, month) %>% summarise(across(c(x1, x2), sum))
df1 %>% group_by(year, month) %>% summarise(across(-date, sum))
# summarising a specific set of non-grouping variables using selection helpers:
df1 %>% group_by(year, month) %>% summarise(across(starts_with('x'), sum))
df1 %>% group_by(year, month) %>% summarise(across(matches('.*[0-9]'), sum))
# summarising a specific set of non-grouping variables based on condition (class)
df1 %>% group_by(year, month) %>% summarise(across(where(is.numeric), sum))
All but the first of these result in:
# A tibble: 24 × 4
# Groups: year [2]
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 131. 27.4
2 2000 2 44.8 155.
3 2000 3 60.7 207.
4 2000 4 -11.5 379.
5 2000 5 64.0 441.
6 2000 6 -16.5 517.
7 2000 7 210. 530.
8 2000 8 112. 573.
9 2000 9 -129. 347.
10 2000 10 -165. 444.
# … with 14 more rows
You can also apply multiple functions to the selected columns:
df1 %>%
group_by(year, month) %>%
summarise(across(x1:x2, list(sum = sum, avg = mean)))
# A tibble: 24 × 6
# Groups: year [2]
year month x1_sum x1_avg x2_sum x2_avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 1 131. 4.24 27.4 0.884
2 2000 2 44.8 1.54 155. 5.34
3 2000 3 60.7 1.96 207. 6.69
4 2000 4 -11.5 -0.385 379. 12.6
5 2000 5 64.0 2.06 441. 14.2
6 2000 6 -16.5 -0.550 517. 17.2
7 2000 7 210. 6.76 530. 17.1
8 2000 8 112. 3.60 573. 18.5
9 2000 9 -129. -4.30 347. 11.6
10 2000 10 -165. -5.33 444. 14.3
# … with 14 more rows
A few final notes:
summarise()
drops the last level of grouping, so all the examples above would still be grouped by year
. To drop all grouping, you can add an ungroup()
call, or set .groups = "drop"
in the summarise()
call..by
argument, e.g., df1 %>% summarise(across(c(x1, x2), sum), .by = c(year, month))
across()
also works with other dplyr verbs such as mutate()
and reframe()
.across()
with dplyr 1.0.0, these kinds of operations were done with summarise_all()
, summarise_at()
, summarise_if()
, and (even earlier) by summarise_each()
. These are now superseded or deprecated in favor of across()
.Upvotes: 70
Reputation: 952
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by = .(year, month), .SDcols = c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package (column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
Upvotes: 59
Reputation: 52399
An updated dplyr
solution: since dplyr 1.1.0
, you can use .by
in summarise
to do an inline temporary grouping (which automatically ungroup
s after the computation).
Using across
(available from dplyr 1.0.0
) allows to use the same function for multiple columns at the same time.
library(dplyr)
df1 %>%
summarise(across(starts_with('x'), sum), .by = c(year, month))
# A tibble: 24 x 4
# year month x1 x2
# <dbl> <dbl> <dbl> <dbl>
# 1 2000 1 11.7 52.9
# 2 2000 2 -74.1 126.
# 3 2000 3 -132. 149.
# 4 2000 4 -130. 4.12
# 5 2000 5 -91.6 -55.9
# 6 2000 6 179. 73.7
# 7 2000 7 95.0 409.
# 8 2000 8 255. 283.
# 9 2000 9 489. 331.
#10 2000 10 719. 305.
# … with 14 more rows
Upvotes: 2
Reputation: 5702
Below is another way to summarize multiple columns, especially useful when the function needs further arguments. You can select all columns via everything()
or a subset of columns like any_of(c("a", "b"))
.
library(dplyr)
# toy data
df <- tibble(a = sample(c(NA, 5:7), 30, replace = TRUE),
b = sample(c(NA, 1:5), 30, replace = TRUE),
c = sample(1:5, 30, replace = TRUE),
grp = sample(1:3, 30, replace = TRUE))
df
#> # A tibble: 30 × 4
#> a b c grp
#> <int> <int> <int> <int>
#> 1 7 1 3 1
#> 2 7 4 4 2
#> 3 5 1 3 3
#> 4 7 NA 3 2
#> 5 7 2 5 2
#> 6 7 4 4 2
#> 7 7 NA 3 3
#> 8 NA 5 4 1
#> 9 5 1 1 2
#> 10 NA 3 1 2
#> # … with 20 more rows
df %>%
group_by(grp) %>%
summarise(across(everything(),
list(mean = ~mean(., na.rm = TRUE),
q75 = ~quantile(., probs = .75, na.rm = TRUE))))
#> # A tibble: 3 × 7
#> grp a_mean a_q75 b_mean b_q75 c_mean c_q75
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6.6 7 2.88 4.25 3 4
#> 2 2 6.33 7 2.62 3.25 2.9 4
#> 3 3 5.78 6 3.33 4 3.09 4
Upvotes: 1
Reputation: 887951
With the dplyr
version >= 1.0.0
, we can also use summarise
to apply function on multiple columns with across
library(dplyr)
df1 %>%
group_by(year, month) %>%
summarise(across(starts_with('x'), sum))
# A tibble: 24 x 4
# Groups: year [2]
# year month x1 x2
# <dbl> <dbl> <dbl> <dbl>
# 1 2000 1 11.7 52.9
# 2 2000 2 -74.1 126.
# 3 2000 3 -132. 149.
# 4 2000 4 -130. 4.12
# 5 2000 5 -91.6 -55.9
# 6 2000 6 179. 73.7
# 7 2000 7 95.0 409.
# 8 2000 8 255. 283.
# 9 2000 9 489. 331.
#10 2000 10 719. 305.
# … with 14 more rows
Upvotes: 8
Reputation: 1369
For a more flexible and faster approach to data aggregation, check out the collap
function in the collapse R package available on CRAN:
library(collapse)
# Simple aggregation with one function
head(collap(df1, x1 + x2 ~ year + month, fmean))
year month x1 x2
1 2000 1 -1.217984 4.008534
2 2000 2 -1.117777 11.460301
3 2000 3 5.552706 8.621904
4 2000 4 4.238889 22.382953
5 2000 5 3.124566 39.982799
6 2000 6 -1.415203 48.252283
# Customized: Aggregate columns with different functions
head(collap(df1, x1 + x2 ~ year + month,
custom = list(fmean = c("x1", "x2"), fmedian = "x2")))
year month fmean.x1 fmean.x2 fmedian.x2
1 2000 1 -1.217984 4.008534 3.266968
2 2000 2 -1.117777 11.460301 11.563387
3 2000 3 5.552706 8.621904 8.506329
4 2000 4 4.238889 22.382953 20.796205
5 2000 5 3.124566 39.982799 39.919145
6 2000 6 -1.415203 48.252283 48.653926
# You can also apply multiple functions to all columns
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax)))
year month fmean.x1 fmin.x1 fmax.x1 fmean.x2 fmin.x2 fmax.x2
1 2000 1 -1.217984 -4.2460775 1.245649 4.008534 -1.720181 10.47825
2 2000 2 -1.117777 -5.0081858 3.330872 11.460301 9.111287 13.86184
3 2000 3 5.552706 0.1193369 9.464760 8.621904 6.807443 11.54485
4 2000 4 4.238889 0.8723805 8.627637 22.382953 11.515753 31.66365
5 2000 5 3.124566 -1.5985090 7.341478 39.982799 31.957653 46.13732
6 2000 6 -1.415203 -4.6072295 2.655084 48.252283 42.809211 52.31309
# When you do that, you can also return the data in a long format
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax), return = "long"))
Function year month x1 x2
1 fmean 2000 1 -1.217984 4.008534
2 fmean 2000 2 -1.117777 11.460301
3 fmean 2000 3 5.552706 8.621904
4 fmean 2000 4 4.238889 22.382953
5 fmean 2000 5 3.124566 39.982799
6 fmean 2000 6 -1.415203 48.252283
Note: You can use base functions like mean, max
etc. with collap
, but fmean, fmax
etc. are C++ based grouped functions offered in the collapse package which are significantly faster (i.e. the performance on large data aggregations is the same as data.table while providing greater flexibility, and these fast grouped functions can also be used without collap
).
Note2: collap
also supports flexible multitype data aggregation, which you can of course do using the custom
argument, but you can also apply functions to numeric and non-numeric columns in a semi-automated way:
# wlddev is a data set of World Bank Indicators provided in the collapse package
head(wlddev)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.292 NA 114440000
2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.742 NA 233350000
3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.185 NA 114880000
4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.624 NA 236450000
5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.060 NA 302480000
6 Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.495 NA 370250000
# This aggregates the data, applying the mean to numeric and the statistical mode to categorical columns
head(collap(wlddev, ~ iso3c + decade, FUN = fmean, catFUN = fmode))
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
1 Aruba ABW 1961-01-01 1962.5 1960 Latin America & Caribbean High income FALSE NA 66.58583 NA NA
2 Aruba ABW 1967-01-01 1970.0 1970 Latin America & Caribbean High income FALSE NA 69.14178 NA NA
3 Aruba ABW 1976-01-01 1980.0 1980 Latin America & Caribbean High income FALSE NA 72.17600 NA 33630000
4 Aruba ABW 1987-01-01 1990.0 1990 Latin America & Caribbean High income FALSE 23677.09 73.45356 NA 41563333
5 Aruba ABW 1996-01-01 2000.0 2000 Latin America & Caribbean High income FALSE 26766.93 73.85773 NA 19857000
6 Aruba ABW 2007-01-01 2010.0 2010 Latin America & Caribbean High income FALSE 25238.80 75.01078 NA NA
# Note that by default (argument keep.col.order = TRUE) the column order is also preserved
Upvotes: 4
Reputation: 2747
Interestingly, base R aggregate
's data.frame
method is not showcased here, above the formula interface is used, so for completeness:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
More generic use of aggregate's data.frame method:
Since we are providing a
data.frame
as x
and list
(data.frame
is also a list
) as by
, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simpleFor example like so:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
Upvotes: 25
Reputation: 13310
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
Upvotes: 48
Reputation: 179578
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
Upvotes: 215