How do you calcuate mean values according to a factor while transferring the factor labels?

Question

I have got data with observations in rows. There are an outcome variable y (dbl) as well as multiple factors, herein called f_1 and f_2. The latter denote conditions of an experiment. The data situation is mirrored by the following minimal example:

set.seed(123)

y = rnorm(10)
f_1 = factor(rep(c("A", "B"), 5))
f_2 = factor(rep(c("C", "D"), each = 5))

dat <- data.frame(y, f_1, f_2)

I would like to compute mean values of y for groups defined by f_1 and f_2. Importantly, I do not want a mean value for each combination of f_1 and f_2, but mean values based on f_1 on the one hand and mean values values based on f_2 on the other hand. These should be saved as factors in dat, where each observation has a mean_f_1 (mean value when data is grouped according to f_1) and mean_f_2 (mean value when data is grouped according to f_2). The labels of the new factors mean_f_1 and mean_f_2 should correspond to the values = labels of f_1 and f_2. The labels have a meaning. Thus, a mean calculated for group "A" (from f_1) should keep the label "A" (in mean_f_1). The number of condition variables f_... in the original data is higher than 2. Thus, I would like to not repeat code for each factor (see I).

I have come up with two approaches. The first (I; group_by approach) gives the desired result. But repeats code for each factor.

I) group_by approach

library(dplyr)

dat %>% 
  
  group_by(f_1) %>% 
  mutate(mean_f_1 = factor(mean(y), label = unique(f_1))) %>% 
  
  group_by(f_2) %>% 
  mutate(mean_f_2 = factor(mean(y), label = unique(f_2)))

In other words, repeating the 'group_by - mutate' statements for each factor seems avoidable. I did not manage to use across() here.

The other approach (II; ave approach) avoids code repetition, but wont assign factor labels. Assigning factor labels using unique() messed up the order of labels in the original data.

II) ave approach

dat %>% mutate(across(starts_with("f"), 
                      ~ ave(y, .x, FUN = mean),
                      .names = "mean_{.col}"))

Do you have an idea how to ...

... improve (I) to work on multiple factors?
... improve (II) to include factor labels?
... solve the problem differently?

A dplyr solution is preferred.

Simon.S.A. · Accepted Answer

To avoid repeating code for each factor, I suggest iterating over factors. Something like:

library(dplyr)

factors = c("f_1", "f_2")

for(ff in factors){

  new_col = paste0("mean_",ff)

  dat <- dat %>% 
    group_by(!!sym(ff)) %>% 
    mutate(!!sym(new_col) := factor(mean(y), label = unique(!!sym(ff))))
}

This produces identical output to your group_by approach. To scale up to more columns, add these to the factors array and the code will iterate overthem.

The !!sym(.) is used to turn a character string into a column name. There are several other ways to do this, see the programming with dplyr vignette for other options. The unusual assignment operator := has the same behavior as = except it can accept some prep on the left-hand-side.

How do you calcuate mean values according to a factor while transferring the factor labels?

Answers (1)

Related Questions