Reputation: 385
I'm looking for a way to make a summary table that compares values each level of a factor in R to all other levels for that factor. An example with the iris dataset - I would want to compare setosa to all others (i.e., versicolor and virginica), then versicolor to others (setosa and virginica), and finally virignica to others (versicolor and setosa). In my actual dataset I have many groups, so I don't want to hard code each level. I'm looking for a tidyverse solution if possible. My desired outcome would be a summarise table that looks like this:
Here - the 'yes' in 'in group' is the Species that are in that group (so for setosa, it would be setosa only) and the 'no' is the Species that are not in that group (so for Setosa - no it would be versicolor and virgnicia combined).
Upvotes: 1
Views: 275
Reputation: 887118
1) We can do this within dplyr
itself. Grouped by 'Species', summarise
the column we need i.e 'Sepal.length' by concatenating the mean
of the column and the subset of the full data column by making use of cur_group_id()
(No other packages except dplyr
is used)
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(InGroup = c('Yes', 'No'), MeanSepalLength = c(mean(Sepal.Length),
mean(.$Sepal.Length[as.numeric(.$Species) != cur_group_id()])),
.groups = 'drop')
# A tibble: 6 x 3
# Species InGroup MeanSepalLength
# <fct> <chr> <dbl>
#1 setosa Yes 5.01
#2 setosa No 6.26
#3 versicolor Yes 5.94
#4 versicolor No 5.80
#5 virginica Yes 6.59
#6 virginica No 5.47
2) If we want to do this in multiple columns, use across
iris %>%
group_by(Species) %>%
summarise(InGroup = c('Yes', 'No'),
across(where(is.numeric), ~ c(mean(.),
mean(iris[[cur_column()]][
as.numeric(iris$Species) != cur_group_id()])), .names = 'Mean{.col}'),
.groups = 'drop')
-output
# A tibble: 6 x 6
# Species InGroup MeanSepal.Length MeanSepal.Width MeanPetal.Length MeanPetal.Width
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#1 setosa Yes 5.01 3.43 1.46 0.246
#2 setosa No 6.26 2.87 4.91 1.68
#3 versicolor Yes 5.94 2.77 4.26 1.33
#4 versicolor No 5.80 3.20 3.51 1.14
#5 virginica Yes 6.59 2.97 5.55 2.03
#6 virginica No 5.47 3.10 2.86 0.786
3) If we need a function, that can be created as well
f1 <- function(dat, grp) {
grp_str <- rlang::as_string(rlang::ensym(grp))
dat %>%
group_by({{grp}}) %>%
summarise(InGroup = c('Yes', 'No'),
across(where(is.numeric), ~ c(mean(.),
mean(dat[[cur_column()]][
as.numeric(dat[[grp_str]]) != cur_group_id()])),
.names = 'Mean{.col}'), .groups = 'drop')
}
-testing
f1(iris, Species)
# A tibble: 6 x 6
# Species InGroup MeanSepal.Length MeanSepal.Width MeanPetal.Length MeanPetal.Width
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#1 setosa Yes 5.01 3.43 1.46 0.246
#2 setosa No 6.26 2.87 4.91 1.68
#3 versicolor Yes 5.94 2.77 4.26 1.33
#4 versicolor No 5.80 3.20 3.51 1.14
#5 virginica Yes 6.59 2.97 5.55 2.03
#6 virginica No 5.47 3.10 2.86 0.786
or with diamonds
f1(diamonds, cut)
# A tibble: 10 x 9
# cut InGroup Meancarat Meandepth Meantable Meanprice Meanx Meany Meanz
# <ord> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Fair Yes 1.05 64.0 59.1 4359. 6.25 6.18 3.98
# 2 Fair No 0.790 61.7 57.4 3920. 5.72 5.72 3.53
# 3 Good Yes 0.849 62.4 58.7 3929. 5.84 5.85 3.64
# 4 Good No 0.793 61.7 57.3 3933. 5.72 5.72 3.53
# 5 Very Good Yes 0.806 61.8 58.0 3982. 5.74 5.77 3.56
# 6 Very Good No 0.796 61.7 57.3 3919. 5.73 5.72 3.53
# 7 Premium Yes 0.892 61.3 58.7 4584. 5.97 5.94 3.65
# 8 Premium No 0.766 61.9 57.0 3709. 5.65 5.66 3.50
# 9 Ideal Yes 0.703 61.7 56.0 3458. 5.51 5.52 3.40
#10 Ideal No 0.861 61.8 58.5 4249. 5.88 5.88 3.63
4) Or another option would be to take the difference in sum
and divide by the difference in number of rows
iris %>%
group_by(Species) %>%
summarise(InGroup = c('Yes', 'No'), across(where(is.numeric),
~ c(mean(.), (sum(iris[[cur_column()]]) -
sum(.))/(nrow(iris) - n())), .names = 'Mean{.col}'), .groups = 'drop')
Upvotes: 0
Reputation: 93811
In the code below, we use map
to operate separately on each level of Species
. For each iteration, we create a grouping column in.group
marking whether a row is or is not a member of the given species. We then return means by group for all numeric columns:
library(tidyverse)
unique(as.character(iris$Species)) %>%
set_names() %>%
map_df(
~iris %>%
group_by(in.group = Species==.x) %>%
summarise(across(where(is.numeric), mean, .names="mean_{col}")),
.id="Species"
)
#> # A tibble: 6 x 6
#> Species in.group mean_Sepal.Length mean_Sepal.Width mean_Petal.Length
#> <chr> <lgl> <dbl> <dbl> <dbl>
#> 1 setosa FALSE 6.26 2.87 4.91
#> 2 setosa TRUE 5.01 3.43 1.46
#> 3 versicolor FALSE 5.80 3.20 3.51
#> 4 versicolor TRUE 5.94 2.77 4.26
#> 5 virginica FALSE 5.47 3.10 2.86
#> 6 virginica TRUE 6.59 2.97 5.55
#> # … with 1 more variable: mean_Petal.Width <dbl>
You can also add the following onto the chain to make the output a bit more economical:
mutate(Species = case_when(in.group ~ Species,
!in.group ~ paste("not", Species))) %>%
select(-in.group)
Which gives:
Species mean_Sepal.Length mean_Sepal.Width mean_Petal.Length mean_Petal.Width
1 not setosa 6.26 2.87 4.91 1.68
2 setosa 5.01 3.43 1.46 0.246
3 not versicolor 5.80 3.20 3.51 1.14
4 versicolor 5.94 2.77 4.26 1.33
5 not virginica 5.47 3.10 2.86 0.786
6 virginica 6.59 2.97 5.55 2.03
You can package this as a function:
compare.groups = function(data, group) {
group = ensym(group)
# Get levels of group
x = data %>%
distinct(!!group) %>%
pull(!!group) %>%
as.character %>%
set_names()
# Map over each level
x %>%
map_df(
~ data %>%
group_by(in.group = !!group == .x) %>%
summarise(across(where(is.numeric), mean, .names="mean_{col}")),
.id=as_label(enquo(group))
) %>%
mutate(!!group := case_when(in.group ~ !!group,
!in.group ~ paste("not", !!group))) %>%
select(-in.group)
}
# Run the function on a couple of data frames
compare.groups(iris, Species)
compare.groups(diamonds, cut)
You can also use the function to get results for all categorical columns in your data frame:
diamonds %>%
select(where(~!is.numeric(.))) %>%
names() %>%
set_names() %>%
map_df(
~compare.groups(diamonds, !!.x) %>%
rename(category = .x),
.id="variable"
)
variable category mean_carat mean_depth mean_table mean_price mean_x mean_y mean_z 1 cut not Ideal 0.861 61.8 58.5 4249. 5.88 5.88 3.63 2 cut Ideal 0.703 61.7 56.0 3458. 5.51 5.52 3.40 3 cut not Premium 0.766 61.9 57.0 3709. 5.65 5.66 3.50 4 cut Premium 0.892 61.3 58.7 4584. 5.97 5.94 3.65 5 cut not Good 0.793 61.7 57.3 3933. 5.72 5.72 3.53 6 cut Good 0.849 62.4 58.7 3929. 5.84 5.85 3.64 7 cut not Very Good 0.796 61.7 57.3 3919. 5.73 5.72 3.53 8 cut Very Good 0.806 61.8 58.0 3982. 5.74 5.77 3.56 9 cut not Fair 0.790 61.7 57.4 3920. 5.72 5.72 3.53 10 cut Fair 1.05 64.0 59.1 4359. 6.25 6.18 3.98 11 color not E 0.829 61.8 57.4 4123. 5.80 5.80 3.58 12 color E 0.658 61.7 57.5 3077. 5.41 5.42 3.34 13 color not I 0.772 61.7 57.4 3803. 5.68 5.68 3.50 14 color I 1.03 61.8 57.6 5092. 6.22 6.22 3.85 15 color not J 0.778 61.7 57.4 3856. 5.69 5.69 3.51 16 color J 1.16 61.9 57.8 5324. 6.52 6.52 4.03 17 color not H 0.777 61.7 57.4 3832. 5.69 5.69 3.51 18 color H 0.912 61.8 57.5 4487. 5.98 5.98 3.70 19 color not F 0.811 61.8 57.5 3977. 5.76 5.76 3.55 20 color F 0.737 61.7 57.4 3725. 5.61 5.62 3.46 # … with 20 more rows
Upvotes: 1