hmje
hmje

Reputation: 117

Get proportions of a dataframe grouped by multiple variables in R

I have some dataframe like

df <- tribble(
  ~x, ~y, ~z,
  FALSE,"N",1,
  FALSE,"N",2,
  FALSE,"W",1,
  FALSE,"E",3,
  FALSE,"E",1,
  TRUE,"N",2,
  TRUE,"W",2,
  TRUE,"E",1
)

Now I want to group this by the first two variables, then attach the proportion column, so I tried

df %>%
  group_by(x,y) %>%
  summarize(count = n()) %>%
  mutate(prop = count/sum(count))

But I get

tribble(
  ~x, ~y, ~count, ~prop
  FALSE,"E", 2,   0.4
  FALSE,"N", 2,   0.4  
  FALSE,"W", 1,   0.2
  TRUE,"E", 1,   0.33
  TRUE,"N", 1,   0.33
  TRUE,"W", 1,   0.33
)

instead of

tribble(
  ~x, ~y, ~count, ~prop
  FALSE,"E", 2,   0.25
  FALSE,"N", 2,   0.25 
  FALSE,"W", 1,   0.125
  TRUE,"E", 1,   0.125
  TRUE,"N", 1,   0.125
  TRUE,"W", 1,   0.125
)

. Why does this happen?

Upvotes: 0

Views: 2258

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

Another way without grouping would be to count and then calculate proportions.

library(dplyr)

df %>% count(x, y) %>% mutate(n = n/sum(n))

#   x     y      n
#  <lgl> <chr> <dbl>
#1 FALSE E     0.25 
#2 FALSE N     0.25 
#3 FALSE W     0.125
#4 TRUE  E     0.125
#5 TRUE  N     0.125
#6 TRUE  W     0.125

Upvotes: 2

Andrew Brown
Andrew Brown

Reputation: 1065

When you group_by(x,y) then you get a grouped data frame by x and y. After summarize(), you get a data frame grouped by only x. You need an ungroup() before the mutate() to produce the result you want.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- tribble(
  ~x, ~y, ~z,
  FALSE,"N",1,
  FALSE,"N",2,
  FALSE,"W",1,
  FALSE,"E",3,
  FALSE,"E",1,
  TRUE,"N",2,
  TRUE,"W",2,
  TRUE,"E",1
)

df %>%
  group_by(x,y) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(prop = count/sum(count))
#> `summarise()` regrouping output by 'x' (override with `.groups` argument)
#> # A tibble: 6 x 4
#>   x     y     count  prop
#>   <lgl> <chr> <int> <dbl>
#> 1 FALSE E         2 0.25 
#> 2 FALSE N         2 0.25 
#> 3 FALSE W         1 0.125
#> 4 TRUE  E         1 0.125
#> 5 TRUE  N         1 0.125
#> 6 TRUE  W         1 0.125

Created on 2020-11-23 by the reprex package (v0.3.0)

See also the summarize() .groups argument for more interesting options for how to handle multiple groups/levels. The number of rows per group matters.

Upvotes: 3

Related Questions