Reputation: 7151
When I try to create several columns within summarize()
, I can reference a newly created column name in the same summarize statement.
Example:
Goal: Try to calculate the standard error ("se") based on the standard deviation ("sd").
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd)))
returns
style score_mean score_sd score_se
* <fct> <dbl> <dbl> <dbl>
1 S1 3.5 0.707 0.707
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd/sqrt(nrow(score)))))
returns
Error: Problem with `summarise()` input `..1`.
x non-numeric argument to binary operator
ℹ Input `..1` is `across(score, list(mean = mean, sd = sd, se = sd/sqrt(nrow(data))))`.
ℹ The error occured in group 1: style = "S1".
I replaced the grouped data in nrow(score))
by the other column names or even nrow(data)
, but they all led to the same error message.
I replaced the assignement for se sd/sqrt(nrow(score)))
with different variations leading all to the same error. The simplest was sd/2
, so even dividing by a constant doesn't work.
I replaced sd
by score_sd
to reference the new column created, as seen in the output (Step 1). Still the same error message.
The error message just refers to the whole across()
statement, so doesn't help to narrow down the root cause.
My hunch is that I have to reference the grouped data somehow, but I tried
se = sd(.)/sqrt(nrow(data)
with no success.
Would be grateful for any hints...
Minimal reproducible example:
data <- structure(list(style = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("S1", "S2", "S3", "S4", "S5"), class = "factor"),
param = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), score = c(4,
1, 1, 3, 3, 3, 5, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 1
Views: 551
Reputation: 7151
After many trial & error attempts, I found the solution myself. This is for everyone who is not yet familiar with the across
function, as dplyr 1.0.0
is not yet released.
So the answer to my question is:
You must reference the grouped data by the .
operator - BUT ONLY IF you use the purrr
formula operator ~
!
However, you must NOT reference the grouped data in the n()
function, as the n()
does NOT accept the .
operator.
The second point took endless trials to find out, and is the reason why I wanted to share this solution.
You might not find it intuitive to understand either that, even though n()
is defined with brackets, it is never allowed to use the .
operator as it always refers to the grouped data.
This is how this double trick looks like:
data %>%
group_by(style) %>%
summarise(across(
score,
list(mean = mean, sd = sd, se = ~sd(.)/sqrt(n()))
))
If you know it, it's easy :-)
Upvotes: 2