hmhensen
hmhensen

Reputation: 3195

Duplicate rows after summarise with case_when

I'm coming across an oddity using summarise with case_when. I'm using case_when because I need to perform a different calculation for a subset of my groups. But when I do, the groups don't collapse as expected. When I ran it without case_when, it collapsed. Here's an example.

Here's what I tried.

df %>% 
  group_by(grp) %>% 
  summarize(value = case_when(grp == "g4" ~ sum(x), 
                              T ~ weighted.mean(x, w)), 
            .groups = "keep") %>% 
  ungroup()
## A tibble: 12 x 2
#grp        value
#<chr>      <dbl>
#1 g1        0.466 
#2 g1        0.466 
#3 g1        0.466 
#4 g2        0.439 
#5 g2        0.439 
#6 g2        0.439 
#7 g3        0.0885
#8 g3        0.0885
#9 g3        0.0885
#10 g4    15839     
#11 g4    15839     
#12 g4    15839 

As you can see, the groups are all duplicated rather than collapsed. If I take out the case_when (or ifelse) and use a single calculation on the column, then the data frame comes out as expected, although the value for "g4" is wrong since I didn't use the correct function.

df %>% 
  group_by(grp) %>% 
  summarize(value = weighted.mean(x, w), 
            .groups = "keep") %>% 
  ungroup()
## A tibble: 4 x 2
#grp       value
#<chr>     <dbl>
#1 g1       0.466 
#2 g2       0.439 
#3 g3       0.0885
#4 g4    6726.  

What's going on here and how do I fix it? I know I can use distinct but I'd like to do this correctly and understand why I'm not getting the expected output.

Data:

df <- structure(list(var = c("A", "A", "A", "A", "B", "B", "B", "B", 
                             "C", "C", "C", "C"), grp = c("g1", "g2", "g3", "g4", "g1", "g2", 
                                                          "g3", "g4", "g1", "g2", "g3", "g4"), x = c(0.374, 0.348, 0.067, 
                                                                                                     1531, 0.484, 0.461, 0.088, 8943, 0.47, 0.437, 0.095, 5365), w = c(4300, 
                                                                                                                                                                       4300, 4300, 4300, 19572, 19572, 19572, 19572, 15461, 15461, 15461, 
                                                                                                                                                                       15461)), row.names = c(NA, -12L), class = c("tbl_df", "tbl", 
                                                                                                                                                                                                                   "data.frame"))

Upvotes: 0

Views: 858

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389135

That is because grp == "g4" comparison returns more than one value in each group. For example, for group 'g1' it returns 3 values.

df$grp[df$grp == 'g1'] == "g4"
#[1] FALSE FALSE FALSE

Since the input condition is length 3, output would have 3 values as well. To have only one value in each group the condition should also return only one value. You can wrap it in all.

library(dplyr)

df %>% 
  group_by(grp) %>% 
  summarize(value = case_when(all(grp == "g4") ~ sum(x), 
                              T ~ weighted.mean(x, w)), 
            .groups = "drop") 

# A tibble: 4 x 2
#  grp        value
#  <chr>      <dbl>
#1 g1        0.466 
#2 g2        0.439 
#3 g3        0.0885
#4 g4    15839     

You can also use if/else since there are only 2 outcomes here.

df %>% 
  group_by(grp) %>% 
  summarize(value = if(all(grp == "g4")) sum(x) else weighted.mean(x, w), .groups = "drop")

Upvotes: 2

Related Questions