Reputation: 3195
I'm coming across an oddity using summarise
with case_when
. I'm using case_when
because I need to perform a different calculation for a subset of my groups. But when I do, the groups don't collapse as expected. When I ran it without case_when
, it collapsed. Here's an example.
Here's what I tried.
df %>%
group_by(grp) %>%
summarize(value = case_when(grp == "g4" ~ sum(x),
T ~ weighted.mean(x, w)),
.groups = "keep") %>%
ungroup()
## A tibble: 12 x 2
#grp value
#<chr> <dbl>
#1 g1 0.466
#2 g1 0.466
#3 g1 0.466
#4 g2 0.439
#5 g2 0.439
#6 g2 0.439
#7 g3 0.0885
#8 g3 0.0885
#9 g3 0.0885
#10 g4 15839
#11 g4 15839
#12 g4 15839
As you can see, the groups are all duplicated rather than collapsed. If I take out the case_when
(or ifelse
) and use a single calculation on the column, then the data frame comes out as expected, although the value for "g4" is wrong since I didn't use the correct function.
df %>%
group_by(grp) %>%
summarize(value = weighted.mean(x, w),
.groups = "keep") %>%
ungroup()
## A tibble: 4 x 2
#grp value
#<chr> <dbl>
#1 g1 0.466
#2 g2 0.439
#3 g3 0.0885
#4 g4 6726.
What's going on here and how do I fix it? I know I can use distinct
but I'd like to do this correctly and understand why I'm not getting the expected output.
Data:
df <- structure(list(var = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), grp = c("g1", "g2", "g3", "g4", "g1", "g2",
"g3", "g4", "g1", "g2", "g3", "g4"), x = c(0.374, 0.348, 0.067,
1531, 0.484, 0.461, 0.088, 8943, 0.47, 0.437, 0.095, 5365), w = c(4300,
4300, 4300, 4300, 19572, 19572, 19572, 19572, 15461, 15461, 15461,
15461)), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
Upvotes: 0
Views: 858
Reputation: 389135
That is because grp == "g4"
comparison returns more than one value in each group. For example, for group 'g1'
it returns 3 values.
df$grp[df$grp == 'g1'] == "g4"
#[1] FALSE FALSE FALSE
Since the input condition is length 3, output would have 3 values as well. To have only one value in each group the condition should also return only one value. You can wrap it in all
.
library(dplyr)
df %>%
group_by(grp) %>%
summarize(value = case_when(all(grp == "g4") ~ sum(x),
T ~ weighted.mean(x, w)),
.groups = "drop")
# A tibble: 4 x 2
# grp value
# <chr> <dbl>
#1 g1 0.466
#2 g2 0.439
#3 g3 0.0885
#4 g4 15839
You can also use if
/else
since there are only 2 outcomes here.
df %>%
group_by(grp) %>%
summarize(value = if(all(grp == "g4")) sum(x) else weighted.mean(x, w), .groups = "drop")
Upvotes: 2