Reputation: 100
I have got data with observations in rows. There are an outcome variable y (dbl) as well as multiple factors, herein called f_1 and f_2. The latter denote conditions of an experiment. The data situation is mirrored by the following minimal example:
set.seed(123)
y = rnorm(10)
f_1 = factor(rep(c("A", "B"), 5))
f_2 = factor(rep(c("C", "D"), each = 5))
dat <- data.frame(y, f_1, f_2)
I would like to compute mean values of y for groups defined by f_1 and f_2. Importantly, I do not want a mean value for each combination of f_1 and f_2, but mean values based on f_1 on the one hand and mean values values based on f_2 on the other hand. These should be saved as factors in dat, where each observation has a mean_f_1 (mean value when data is grouped according to f_1) and mean_f_2 (mean value when data is grouped according to f_2). The labels of the new factors mean_f_1 and mean_f_2 should correspond to the values = labels of f_1 and f_2. The labels have a meaning. Thus, a mean calculated for group "A" (from f_1) should keep the label "A" (in mean_f_1). The number of condition variables f_... in the original data is higher than 2. Thus, I would like to not repeat code for each factor (see I).
I have come up with two approaches. The first (I; group_by approach) gives the desired result. But repeats code for each factor.
I) group_by approach
library(dplyr)
dat %>%
group_by(f_1) %>%
mutate(mean_f_1 = factor(mean(y), label = unique(f_1))) %>%
group_by(f_2) %>%
mutate(mean_f_2 = factor(mean(y), label = unique(f_2)))
In other words, repeating the 'group_by - mutate' statements for each factor seems avoidable. I did not manage to use across() here.
The other approach (II; ave approach) avoids code repetition, but wont assign factor labels. Assigning factor labels using unique() messed up the order of labels in the original data.
II) ave approach
dat %>% mutate(across(starts_with("f"),
~ ave(y, .x, FUN = mean),
.names = "mean_{.col}"))
Do you have an idea how to ...
A dplyr solution is preferred.
Upvotes: 2
Views: 80
Reputation: 6941
To avoid repeating code for each factor, I suggest iterating over factors. Something like:
library(dplyr)
factors = c("f_1", "f_2")
for(ff in factors){
new_col = paste0("mean_",ff)
dat <- dat %>%
group_by(!!sym(ff)) %>%
mutate(!!sym(new_col) := factor(mean(y), label = unique(!!sym(ff))))
}
This produces identical output to your group_by
approach. To scale up to more columns, add these to the factors
array and the code will iterate overthem.
The !!sym(.)
is used to turn a character string into a column name. There are several other ways to do this, see the programming with dplyr vignette for other options. The unusual assignment operator :=
has the same behavior as =
except it can accept some prep on the left-hand-side.
Upvotes: 1