Agile Bean
Agile Bean

Reputation: 7151

cannot reference grouped data in summarize(across(...))

When I try to create several columns within summarize(), I can reference a newly created column name in the same summarize statement.

Example:

Goal: Try to calculate the standard error ("se") based on the standard deviation ("sd").

Step 1 (start to assign sd for se):

data %>% 
  group_by(style) %>% 
  summarise(across(score,list(mean = mean, sd = sd, se = sd)))

returns

  style score_mean score_sd score_se
* <fct>      <dbl>    <dbl>    <dbl>
1 S1           3.5    0.707    0.707

Step 2: calculate se based on sd

data %>% 
  group_by(style) %>% 
  summarise(across(score,list(mean = mean, sd = sd, se = sd/sqrt(nrow(score)))))

returns

Error: Problem with `summarise()` input `..1`.
x non-numeric argument to binary operator
ℹ Input `..1` is `across(score, list(mean = mean, sd = sd, se = sd/sqrt(nrow(data))))`.
ℹ The error occured in group 1: style = "S1".

Step 3 debugging assignment term

3a) grouped data reference

I replaced the grouped data in nrow(score)) by the other column names or even nrow(data), but they all led to the same error message.

3b) assignment operation

I replaced the assignement for se sd/sqrt(nrow(score))) with different variations leading all to the same error. The simplest was sd/2, so even dividing by a constant doesn't work.

3c) assignment reference

I replaced sd by score_sd to reference the new column created, as seen in the output (Step 1). Still the same error message.

Question: Why does Step 1 work but not Step 2?

The error message just refers to the whole across() statement, so doesn't help to narrow down the root cause.

My hunch is that I have to reference the grouped data somehow, but I tried se = sd(.)/sqrt(nrow(data) with no success.

Would be grateful for any hints...

Minimal reproducible example:

data <- structure(list(style = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L), .Label = c("S1", "S2", "S3", "S4", "S5"), class = "factor"), 
    param = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L
    ), .Label = c("A", "B", "C"), class = "factor"), score = c(4, 
    1, 1, 3, 3, 3, 5, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 1

Views: 551

Answers (1)

Agile Bean
Agile Bean

Reputation: 7151

After many trial & error attempts, I found the solution myself. This is for everyone who is not yet familiar with the across function, as dplyr 1.0.0 is not yet released.

So the answer to my question is:

  1. You must reference the grouped data by the . operator - BUT ONLY IF you use the purrr formula operator ~!

  2. However, you must NOT reference the grouped data in the n() function, as the n() does NOT accept the . operator.

The second point took endless trials to find out, and is the reason why I wanted to share this solution.

You might not find it intuitive to understand either that, even though n() is defined with brackets, it is never allowed to use the . operator as it always refers to the grouped data.

This is how this double trick looks like:

data %>% 
  group_by(style) %>% 
  summarise(across(
    score, 
    list(mean = mean, sd = sd, se = ~sd(.)/sqrt(n()))
  ))

If you know it, it's easy :-)

Upvotes: 2

Related Questions