Reputation: 7337
Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars
data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr
?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
What I would like to achieve:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Upvotes: 251
Views: 301959
Reputation: 1488
Here's simple and idiomatic solution that respects prior grouping (computing proportions within each group and doesn't make you write any variable names twice (a common source of bugs!)
props = function(data, ...) {
data %>%
count(...) %>%
mutate(prop = n / sum(n), .keep="unused", by=...)
}
mtcars %>% group_by(cyl,vs) %>% prop(gear, carb)
prop
sums to one within each original group.
If you're not familiar with the ... syntax. It's incredibly useful for writing reusable functions.
Upvotes: 1
Reputation: 18581
Despite the many answers, one more approach which uses prop.table
in combination with 'dplyr' or 'data.table'.
Since 'dplyr' v. >= 1.1.0 we can use the .by
argument in mutate
:
library(dplyr)
mtcars %>%
count(am, gear) %>%
mutate(freq = prop.table(n), .by = am)
#> am gear n freq
#> 1 0 3 15 0.7894737
#> 2 0 4 4 0.2105263
#> 3 1 4 8 0.6153846
#> 4 1 5 5 0.3846154
Before 'dplyr' v. < 1.1.0 one approach would be:
mtcars %>%
group_by(am, gear) %>%
tally() %>%
mutate(freq = prop.table(n))
#> # A tibble: 4 × 4
#> # Groups: am [2]
#> am gear n freq
#> <dbl> <dbl> <int> <dbl>
#> 1 0 3 15 0.789
#> 2 0 4 4 0.211
#> 3 1 4 8 0.615
#> 4 1 5 5 0.385
With 'data.table' we can do:
library(data.table)
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n), by = "am"][]
#> am gear n freq
#> 1: 0 3 15 0.7894737
#> 2: 0 4 4 0.2105263
#> 3: 1 4 8 0.6153846
#> 4: 1 5 5 0.3846154
Created on 2022-10-22 with reprex v2.0.2
Upvotes: 24
Reputation: 51
Also, try add_count()
(to get around pesky group_by .groups).
mtcars %>%
count(am, gear) %>%
add_count(am, wt = n, name = "nn") %>%
mutate(proportion = n / nn)
Upvotes: 5
Reputation: 1398
For the sake of completeness of this popular question, since version 1.0.0 of dplyr
, parameter .groups controls the grouping structure of the summarise
function after group_by
summarise help.
With .groups = "drop_last"
, summarise
drops the last level of grouping. This was the only result obtained before version 1.0.0.
library(dplyr)
library(scales)
original <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
original
#> # A tibble: 4 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 1 4 8 61.5%
#> 4 1 5 5 38.5%
new_drop_last <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop_last") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(original, new_drop_last)
#> [1] TRUE
With .groups = "drop"
, all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by
# .groups = "drop"
new_drop <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_drop
#> # A tibble: 4 x 4
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 46.9%
#> 2 0 4 4 12.5%
#> 3 1 4 8 25.0%
#> 4 1 5 5 15.6%
If .groups = "keep"
, same grouping structure as .data (mtcars, in this case). summarise
does not peel off any variable used in the group_by
.
Finally, with .groups = "rowwise"
, each row is it's own group. It is equivalent to "keep" in this situation
# .groups = "keep"
new_keep <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "keep") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_keep
#> # A tibble: 4 x 4
#> # Groups: am, gear [4]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 100.0%
#> 2 0 4 4 100.0%
#> 3 1 4 8 100.0%
#> 4 1 5 5 100.0%
# .groups = "rowwise"
new_rowwise <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "rowwise") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE
Another point that can be of interest is that sometimes, after applying group_by
and summarise
, a summary line can help.
# create a subtotal line to help readability
subtotal_am <- mtcars %>%
group_by (am) %>%
summarise (n=n()) %>%
mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)
mtcars %>% group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
bind_rows(subtotal_am) %>%
arrange(am, gear) %>%
mutate(rel.freq = scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 0 NA 19 100.0%
#> 4 1 4 8 61.5%
#> 5 1 5 5 38.5%
#> 6 1 NA 13 100.0%
Created on 2020-11-09 by the reprex package (v0.3.0)
Hope you find this answer useful.
Upvotes: 11
Reputation: 389225
Here is a base R answer using aggregate
and ave
:
df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
df1$prop <- with(df1, n/ave(n, am, FUN = sum))
#Also with prop.table
#df1$prop <- with(df1, ave(n, am, FUN = prop.table))
df1
# am gear n prop
#1 0 3 15 0.7894737
#2 0 4 4 0.2105263
#3 1 4 8 0.6153846
#4 1 5 5 0.3846154
We can also use prop.table
but the output displays differently.
prop.table(table(mtcars$am, mtcars$gear), 1)
# 3 4 5
# 0 0.7894737 0.2105263 0.0000000
# 1 0.0000000 0.6153846 0.3846154
Upvotes: 3
Reputation: 67818
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise
, the last grouping variable specified in group_by
, 'gear', is peeled off. In the mutate
step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups
.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by
call. You may wish to do a subsequent group_by(am)
, to make your code more explicit.
For rounding and prettification, please refer to the nice answer by @Tyler Rinker.
Upvotes: 449
Reputation: 38740
I wrote a small function for this repeating task:
count_pct <- function(df) {
return(
df %>%
tally %>%
mutate(n_pct = 100*n/sum(n))
)
}
I can then use it like:
mtcars %>%
group_by(cyl) %>%
count_pct
It returns:
# A tibble: 3 x 3
cyl n n_pct
<dbl> <int> <dbl>
1 4 11 34.4
2 6 7 21.9
3 8 14 43.8
Upvotes: 10
Reputation: 8940
You can use count()
function, which has however a different behaviour depending on the version of dplyr
:
dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup()
for later manipulations
dplyr 0.7.1
mtcars %>%
count(am, gear) %>%
group_by(am) %>%
mutate(freq = n / sum(n))
dplyr < 0.7.1
mtcars %>%
count(am, gear) %>%
mutate(freq = n / sum(n))
This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup()
.
Upvotes: 53
Reputation: 3252
Here is a general function implementing Henrik's solution on dplyr
0.7.1.
freq_table <- function(x,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
x %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n /sum(n)) %>%
ungroup
}
Upvotes: 6
Reputation: 1875
This answer is based upon Matifou's answer.
First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.
Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.
getOption("scipen")
options("scipen"=10)
mtcars %>%
count(am, gear) %>%
mutate(freq = (n / sum(n)) * 100)
Upvotes: 1
Reputation: 110024
@Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
EDIT Because Spacedman asked for it :-)
as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
class(x) <- c("rel_freq", class(x))
attributes(x)[["rel_freq_col"]] <- rel_freq_col
x
}
print.rel_freq <- function(x, ...) {
freq_col <- attributes(x)[["rel_freq_col"]]
x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")
class(x) <- class(x)[!class(x)%in% "rel_freq"]
print(x)
}
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
as.rel_freq()
## Source: local data frame [4 x 4]
## Groups: am
##
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Upvotes: 36