Reputation: 623
I'm trying to summarize multiple columns using summarize_at()
with a custom function. The part I'm stuck on is the function ssmd()
is meant to take a vector of values from the group established by group_by()
and another vector of values from outside this group.
In the example below, x
should be a vector for each set of values by Month
(varies according to the current group), and y
should be a fixed set of values for Month == 5
.
# custom function
ssmd <- function(x, y){
(mean(x, na.rm = TRUE) - mean(y, na.rm = TRUE)) / sqrt(var(x, na.rm = TRUE) + var(y, na.rm = TRUE))
}
# dataset
d <- airquality
# this isn't working - trying to find the difference between the mean for each Month and the mean of Month 5, for columns Ozone, Solar.R, Wind, and Temp
d %>%
group_by(Month) %>%
summarize_at(vars(Ozone:Temp), funs(ssmd, x = ., y = .[Month == 5])) %>%
ungroup()
At the moment, this gives the following error: Error in mean(y, na.rm = TRUE) : argument "y" is missing, with no default
. So I think I have a syntax error, in addition to being stuck on how to access values from outside the current group.
The expected output is a data frame with one row for each Month and one column for each variable (Ozone, Solar.R, Wind, and Temp).
Upvotes: 0
Views: 326
Reputation: 389235
There are two issues :
1) When you are referring to Month
in funs
it is only for that group and not entire dataframe
2) 1) Can be resolved using .$Month
but you don't have access to entire column in summarize_at
to subset only those values where Month == 5
.
However, you don't need that custom function, you can take mean
of all columns for each Month
and then subtract the values from each column where Month = 5
.
library(dplyr)
d %>%
group_by(Month) %>%
summarize_at(vars(Ozone:Temp), mean, na.rm = TRUE) %>%
mutate_at(vars(Ozone:Temp), ~. - .[Month == 5])
# A tibble: 5 x 5
# Month Ozone Solar.R Wind Temp
# <int> <dbl> <dbl> <dbl> <dbl>
#1 5 0 0 0 0
#2 6 5.83 8.87 -1.36 13.6
#3 7 35.5 35.2 -2.68 18.4
#4 8 36.3 -9.44 -2.83 18.4
#5 9 7.83 -13.9 -1.44 11.4
To use ssmd
function in the updated post we can do :
library(dplyr)
library(purrr)
named_info <- d %>% select(Ozone:Temp) %>% names()
map(named_info, function(x) d %>% group_by(Month) %>%
summarise_at(vars(x), ~ssmd(., d[[x]][d$Month == 5]))) %>%
reduce(inner_join, by = 'Month')
Upvotes: 1
Reputation: 39174
I don't know how to fix your syntax error, but I proposed a workaround here. This summarizes the data as monthly mean for each column, and then just subtract the first value, which is the mean of May.
library(dplyr)
d <- airquality
d1 <- d %>%
group_by(Month) %>%
summarize_at(vars(Ozone:Temp), list(~mean(., na.rm = TRUE))) %>%
ungroup()
d1[-1] <- lapply(d1[-1], function(x) x - x[1])
d1
# # A tibble: 5 x 5
# Month Ozone Solar.R Wind Temp
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 5 0 0 0 0
# 2 6 5.83 8.87 -1.36 13.6
# 3 7 35.5 35.2 -2.68 18.4
# 4 8 36.3 -9.44 -2.83 18.4
# 5 9 7.83 -13.9 -1.44 11.4
Upvotes: 1