Reputation: 1140
I am getting an unexpected result when using dplyr
to create a total relative frequency table and grouping by two variables. Here is an example:
set.seed(1234)
dat1 = data.frame(
color = c(c(rep("red", 4), rep("green", 4))),
type = c(c(rep(c(
"big", "small"
), 4))),
value = sample(1:6, 8, replace = T)
)
dat1 %>% group_by(color, type) %>% summarise(n = n()) %>%
mutate(total = sum(n), rel.freq = n / total)
Here is the result of the preceding code:
# A tibble: 4 x 5
# Groups: color [2]
color type n total rel.freq
<fct> <fct> <int> <int> <dbl>
1 green big 2 4 0.500
2 green small 2 4 0.500
3 red big 2 4 0.500
4 red small 2 4 0.500
However I would expect this:
# A tibble: 4 x 5
# Groups: color [2]
color type n total rel.freq
<fct> <fct> <int> <int> <dbl>
1 green big 2 8 0.250
2 green small 2 8 0.250
3 red big 2 8 0.250
4 red small 2 8 0.250
Any insight into why the mutate on the dplyr
pipe below is grouping only by the first grouping variable (or why it is grouping at all - my notion is that is should be working on the summarise()
data set) would be greatly appreciated.
The total
column should indicate that there are 8 cases in total (i.e., sum(n)
from the previous result in summarise()
should = 8
).
Upvotes: 2
Views: 3913
Reputation: 887118
After each summarise
, one of the grouping elements will be dropped off i.e. the last group in that order. We need to ungroup
after the summarise
dat1 %>%
group_by(color, type) %>%
summarise(n = n()) %>%
ungroup %>%
mutate(total = sum(n), rel.freq = n / total)
# A tibble: 4 x 5
# color type n total rel.freq
# <fct> <fct> <int> <int> <dbl>
#1 green big 2 8 0.25
#2 green small 2 8 0.25
#3 red big 2 8 0.25
#4 red small 2 8 0.25
Upvotes: 5