Marissa
Marissa

Reputation: 385

dplyr summarise based on whether in group or not

I'm looking for a way to make a summary table that compares values each level of a factor in R to all other levels for that factor. An example with the iris dataset - I would want to compare setosa to all others (i.e., versicolor and virginica), then versicolor to others (setosa and virginica), and finally virignica to others (versicolor and setosa). In my actual dataset I have many groups, so I don't want to hard code each level. I'm looking for a tidyverse solution if possible. My desired outcome would be a summarise table that looks like this:

enter image description here

Here - the 'yes' in 'in group' is the Species that are in that group (so for setosa, it would be setosa only) and the 'no' is the Species that are not in that group (so for Setosa - no it would be versicolor and virgnicia combined).

Upvotes: 1

Views: 275

Answers (2)

akrun
akrun

Reputation: 887118

1) We can do this within dplyr itself. Grouped by 'Species', summarise the column we need i.e 'Sepal.length' by concatenating the mean of the column and the subset of the full data column by making use of cur_group_id() (No other packages except dplyr is used)

library(dplyr)
iris %>% 
    group_by(Species) %>% 
    summarise(InGroup = c('Yes', 'No'), MeanSepalLength = c(mean(Sepal.Length),
       mean(.$Sepal.Length[as.numeric(.$Species) != cur_group_id()])),
          .groups = 'drop')
# A tibble: 6 x 3
#  Species    InGroup MeanSepalLength
#  <fct>      <chr>             <dbl>
#1 setosa     Yes                5.01
#2 setosa     No                 6.26
#3 versicolor Yes                5.94
#4 versicolor No                 5.80
#5 virginica  Yes                6.59
#6 virginica  No                 5.47

2) If we want to do this in multiple columns, use across

iris %>% 
    group_by(Species) %>% 
    summarise(InGroup = c('Yes', 'No'), 
       across(where(is.numeric),  ~ c(mean(.),
       mean(iris[[cur_column()]][
         as.numeric(iris$Species) != cur_group_id()])), .names = 'Mean{.col}'),
          .groups = 'drop')

-output

# A tibble: 6 x 6
#  Species    InGroup MeanSepal.Length MeanSepal.Width MeanPetal.Length MeanPetal.Width
#  <fct>      <chr>              <dbl>           <dbl>            <dbl>           <dbl>
#1 setosa     Yes                 5.01            3.43             1.46           0.246
#2 setosa     No                  6.26            2.87             4.91           1.68 
#3 versicolor Yes                 5.94            2.77             4.26           1.33 
#4 versicolor No                  5.80            3.20             3.51           1.14 
#5 virginica  Yes                 6.59            2.97             5.55           2.03 
#6 virginica  No                  5.47            3.10             2.86           0.786

3) If we need a function, that can be created as well

f1 <- function(dat, grp) {
    grp_str <- rlang::as_string(rlang::ensym(grp))
    dat %>%
       group_by({{grp}}) %>%
       summarise(InGroup = c('Yes', 'No'),
            across(where(is.numeric), ~ c(mean(.),
                 mean(dat[[cur_column()]][
                      as.numeric(dat[[grp_str]]) != cur_group_id()])),
                   .names = 'Mean{.col}'), .groups = 'drop')
    }

-testing

f1(iris, Species)
# A tibble: 6 x 6
#  Species    InGroup MeanSepal.Length MeanSepal.Width MeanPetal.Length MeanPetal.Width
#  <fct>      <chr>              <dbl>           <dbl>            <dbl>           <dbl>
#1 setosa     Yes                 5.01            3.43             1.46           0.246
#2 setosa     No                  6.26            2.87             4.91           1.68 
#3 versicolor Yes                 5.94            2.77             4.26           1.33 
#4 versicolor No                  5.80            3.20             3.51           1.14 
#5 virginica  Yes                 6.59            2.97             5.55           2.03 
#6 virginica  No                  5.47            3.10             2.86           0.786

or with diamonds

f1(diamonds, cut)
# A tibble: 10 x 9
#   cut       InGroup Meancarat Meandepth Meantable Meanprice Meanx Meany Meanz
#   <ord>     <chr>       <dbl>     <dbl>     <dbl>     <dbl> <dbl> <dbl> <dbl>
# 1 Fair      Yes         1.05       64.0      59.1     4359.  6.25  6.18  3.98
# 2 Fair      No          0.790      61.7      57.4     3920.  5.72  5.72  3.53
# 3 Good      Yes         0.849      62.4      58.7     3929.  5.84  5.85  3.64
# 4 Good      No          0.793      61.7      57.3     3933.  5.72  5.72  3.53
# 5 Very Good Yes         0.806      61.8      58.0     3982.  5.74  5.77  3.56
# 6 Very Good No          0.796      61.7      57.3     3919.  5.73  5.72  3.53
# 7 Premium   Yes         0.892      61.3      58.7     4584.  5.97  5.94  3.65
# 8 Premium   No          0.766      61.9      57.0     3709.  5.65  5.66  3.50
# 9 Ideal     Yes         0.703      61.7      56.0     3458.  5.51  5.52  3.40
#10 Ideal     No          0.861      61.8      58.5     4249.  5.88  5.88  3.63

4) Or another option would be to take the difference in sum and divide by the difference in number of rows

iris %>% 
    group_by(Species) %>% 
    summarise(InGroup = c('Yes', 'No'), across(where(is.numeric), 
       ~  c(mean(.), (sum(iris[[cur_column()]]) - 
           sum(.))/(nrow(iris) - n())), .names = 'Mean{.col}'), .groups = 'drop')

Upvotes: 0

eipi10
eipi10

Reputation: 93811

In the code below, we use map to operate separately on each level of Species. For each iteration, we create a grouping column in.group marking whether a row is or is not a member of the given species. We then return means by group for all numeric columns:

library(tidyverse)

unique(as.character(iris$Species)) %>% 
  set_names() %>% 
  map_df(
    ~iris %>% 
      group_by(in.group = Species==.x) %>% 
      summarise(across(where(is.numeric), mean, .names="mean_{col}")),
    .id="Species"
  )
#> # A tibble: 6 x 6
#>   Species    in.group mean_Sepal.Length mean_Sepal.Width mean_Petal.Length
#>   <chr>      <lgl>                <dbl>            <dbl>             <dbl>
#> 1 setosa     FALSE                 6.26             2.87              4.91
#> 2 setosa     TRUE                  5.01             3.43              1.46
#> 3 versicolor FALSE                 5.80             3.20              3.51
#> 4 versicolor TRUE                  5.94             2.77              4.26
#> 5 virginica  FALSE                 5.47             3.10              2.86
#> 6 virginica  TRUE                  6.59             2.97              5.55
#> # … with 1 more variable: mean_Petal.Width <dbl>

You can also add the following onto the chain to make the output a bit more economical:

  mutate(Species = case_when(in.group ~ Species,
                             !in.group ~ paste("not", Species))) %>% 
  select(-in.group)

Which gives:

  Species        mean_Sepal.Length mean_Sepal.Width mean_Petal.Length mean_Petal.Width
1 not setosa                  6.26             2.87              4.91            1.68 
2 setosa                      5.01             3.43              1.46            0.246
3 not versicolor              5.80             3.20              3.51            1.14 
4 versicolor                  5.94             2.77              4.26            1.33 
5 not virginica               5.47             3.10              2.86            0.786
6 virginica                   6.59             2.97              5.55            2.03 

You can package this as a function:

compare.groups = function(data, group) {
  
  group = ensym(group)
  
  # Get levels of group
  x = data %>% 
    distinct(!!group) %>% 
    pull(!!group) %>% 
    as.character %>% 
    set_names() 
  
  # Map over each level
  x %>% 
    map_df(
      ~ data %>% 
          group_by(in.group = !!group == .x) %>% 
          summarise(across(where(is.numeric), mean, .names="mean_{col}")),
      .id=as_label(enquo(group))
    ) %>%
    mutate(!!group := case_when(in.group ~ !!group,
                                !in.group ~ paste("not", !!group))) %>% 
    select(-in.group)
}

# Run the function on a couple of data frames
compare.groups(iris, Species)
compare.groups(diamonds, cut)

You can also use the function to get results for all categorical columns in your data frame:

diamonds %>% 
  select(where(~!is.numeric(.))) %>% 
  names() %>% 
  set_names() %>% 
  map_df(
    ~compare.groups(diamonds, !!.x) %>% 
      rename(category = .x),
    .id="variable"
  ) 
   variable category      mean_carat mean_depth mean_table mean_price mean_x mean_y mean_z
 1 cut      not Ideal          0.861       61.8       58.5      4249.   5.88   5.88   3.63
 2 cut      Ideal              0.703       61.7       56.0      3458.   5.51   5.52   3.40
 3 cut      not Premium        0.766       61.9       57.0      3709.   5.65   5.66   3.50
 4 cut      Premium            0.892       61.3       58.7      4584.   5.97   5.94   3.65
 5 cut      not Good           0.793       61.7       57.3      3933.   5.72   5.72   3.53
 6 cut      Good               0.849       62.4       58.7      3929.   5.84   5.85   3.64
 7 cut      not Very Good      0.796       61.7       57.3      3919.   5.73   5.72   3.53
 8 cut      Very Good          0.806       61.8       58.0      3982.   5.74   5.77   3.56
 9 cut      not Fair           0.790       61.7       57.4      3920.   5.72   5.72   3.53
10 cut      Fair               1.05        64.0       59.1      4359.   6.25   6.18   3.98
11 color    not E              0.829       61.8       57.4      4123.   5.80   5.80   3.58
12 color    E                  0.658       61.7       57.5      3077.   5.41   5.42   3.34
13 color    not I              0.772       61.7       57.4      3803.   5.68   5.68   3.50
14 color    I                  1.03        61.8       57.6      5092.   6.22   6.22   3.85
15 color    not J              0.778       61.7       57.4      3856.   5.69   5.69   3.51
16 color    J                  1.16        61.9       57.8      5324.   6.52   6.52   4.03
17 color    not H              0.777       61.7       57.4      3832.   5.69   5.69   3.51
18 color    H                  0.912       61.8       57.5      4487.   5.98   5.98   3.70
19 color    not F              0.811       61.8       57.5      3977.   5.76   5.76   3.55
20 color    F                  0.737       61.7       57.4      3725.   5.61   5.62   3.46
# … with 20 more rows

Upvotes: 1

Related Questions