rf7
rf7

Reputation: 2191

Arrange within a group with dplyr

I am using the library(nycflights13) and I use the following command to group_by month and day, select the top 3 rows within each group and then sort in descending order within each group by departure delay. The code is the following:

flights %>% group_by(month, day)  %>% top_n(3, dep_delay) %>% arrange(desc(dep_delay))

Which returns the following output:

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>
1   2013     1     9      641            900      1301     1242           1530      1272      HA     51  N384HA    JFK   HNL
2   2013     6    15     1432           1935      1137     1607           2120      1127      MQ   3535  N504MQ    JFK   CMH
3   2013     1    10     1121           1635      1126     1239           1810      1109      MQ   3695  N517MQ    EWR   ORD
4   2013     9    20     1139           1845      1014     1457           2210      1007      AA    177  N338AA    JFK   SFO
5   2013     7    22      845           1600      1005     1044           1815       989      MQ   3075  N665MQ    JFK   CVG
6   2013     4    10     1100           1900       960     1342           2211       931      DL   2391  N959DL    JFK   TPA

The records are sorted in descending order but not within groups.

Why is that? What should be done to correct the code? Your advice will be appreciated.

#

Edit

#

Following the suggestions made in the comments I still don't get what I am looking for, i.e. within each month-day grouping sorting of the top 3 records in descending order in terms of the departure delay:

flights %>% group_by(month, day)  %>% top_n(3, dep_delay) %>% arrange(desc(month, day,  dep_delay))

   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>
1   2013    12     1      657           1930       687     1010           2249       681      DL   1091  N342NW    JFK   SAT
2   2013    12     1     1504           1056       248     1628           1230       238      EV   5309  N615QX    LGA   BGR
3   2013    12     1     2017           1455       322     2146           1609       337      DL   1164  N6704Z    JFK   BOS
4   2013    12     2     1139            745       234     1358           1012       226      DL    807  N935AT    EWR   ATL
5   2013    12     2     1823           1345       278     2123           1640       283      UA   1510  N75861    EWR   IAH
6   2013    12     2     1842           1428       254     1955           1545       250      EV   5712  N827AS    JFK   IAD
7   2013    12     3      920            600       200     1158            846       192      B6    583  N535JB    JFK   MCO
8   2013    12     3     1424           1114       190     1713           1347       206      UA    405  N437UA    LGA   DEN
9   2013    12     3     2300           1935       205      116           2203       193      FL   1346  N964AT    LGA   ATL
10  2013    12     4     1210            829       221     1440           1055       225      EV   4419  N23139    EWR   XNA

Upvotes: 21

Views: 39147

Answers (2)

Jeff Bezos
Jeff Bezos

Reputation: 2253

Just add the .by_group = TRUE argument to arrange

flights %>%
   group_by(month, day) %>%
   top_n(3, dep_delay) %>%
   arrange(dep_delay, .by_group = TRUE)

Upvotes: 33

Sraffa
Sraffa

Reputation: 1668

I think the problem in your second example is that your are using desc on all the variables at the same time, so it is only applied to the month column.

   flights %>% group_by(month, day)  %>% top_n(3, dep_delay) %>% 
        arrange(
            month, 
            day,  
            desc(dep_delay)
        )

Source: local data frame [1,108 x 19]
Groups: month, day [365]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr>
1   2013     1     1      848           1835       853     1001           1950       851      MQ   3944  N942MQ    JFK
2   2013     1     1     2343           1724       379      314           1938       456      EV   4321  N21197    EWR
3   2013     1     1     1815           1325       290     2120           1542       338      EV   4417  N17185    EWR
4   2013     1     2     2131           1512       379     2340           1741       359      UA    488  N593UA    LGA
5   2013     1     2     1607           1030       337     2003           1355       368      AA    179  N324AA    JFK
6   2013     1     2     1412            838       334     1710           1147       323      UA    468  N474UA    EWR
7   2013     1     3     2056           1605       291     2239           1754       285      9E   3459  N928XJ    JFK
8   2013     1     3     2008           1540       268     2339           1909       270      DL   2027  N338NW    JFK
9   2013     1     3     2012           1600       252     2314           1857       257      B6    369  N558JB    LGA
10  2013     1     4     2123           1635       288     2332           1856       276      EV   3805  N29917    EWR
# ... with 1,098 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Upvotes: 15

Related Questions