zimmeee
zimmeee

Reputation: 333

dplyr: Arrange not behaving as expected after group_by and summarize

I must be missing something with how group_by levels in dplyr get peeled off. In the example below, I group by 2 columns, summarize values into a single variable, then sort by that new variable:

mtcars %>% group_by( cyl, gear ) %>% 
  summarize( hp_range = max(hp) - min(mpg)) %>% 
  arrange( desc(hp_range) )

# Source: local data frame [8 x 3]
# Groups: cyl [3]
#
#    cyl  gear  hp_range
#  (dbl) (dbl) (dbl)
#1     4     4  87.6
#2     4     5  87.0
#3     4     3  75.5
#4     6     5 155.3
#5     6     4 105.2
#6     6     3  91.9
#7     8     5 320.0
#8     8     3 234.6

Obviously this is not sorted by hp_range as intended. What am I missing?

EDIT: The example works as expected without the call to desc in arrange. Still unclear why?

Upvotes: 5

Views: 2580

Answers (1)

zimmeee
zimmeee

Reputation: 333

Ok, just got to the bottom of this:

  1. The call to desc had no effect, it was by chance that the example did not work without it
  2. The key is that when you group_by multiple columns, it seems that results are automatically sorted by the Groups. In the example above it is sorted by cyl. To get the intended sort of the entire data table, you must first ungroup and then arrange

    mtcars %>% group_by( cyl, gear ) %>% 
       summarize( hp_range = max(hp) - min(mpg)) %>% 
       ungroup() %>% 
       arrange( hp_range )
    

Upvotes: 8

Related Questions