Wassadamo
Wassadamo

Reputation: 1386

How to subset and order rows in one data.table call?

I'm learning data.table and am trying to subset and re-order grouped data while calculating averages, all in one data.table statement.

I got the flight data from here while following along in this tutorial.

From the tutorial,

How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?

ans <- flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]
ans
#      origin dest month         V1         V2
#   1:    JFK  LAX     1   6.590361 14.2289157
#   2:    LGA  PBI     1  -7.758621  0.3103448
#   3:    EWR  LAX     1   1.366667  7.5000000
#   4:    JFK  MIA     1  15.720670 18.7430168
#   5:    JFK  SEA     1  14.357143 30.7500000
#  ---                                        
# 196:    LGA  MIA    10  -6.251799 -1.4208633
# 197:    JFK  MIA    10  -1.880184  6.6774194
# 198:    EWR  PHX    10  -3.032258 -4.2903226
# 199:    JFK  MCO    10 -10.048387 -1.6129032
# 200:    JFK  DCA    10  16.483871 15.5161290

This makes sense to me. However, I want to improve on this by organizing the flights which departed from the same origin together. Following the schema of...

data.table(subset, select/compute, group by)

I came up with this:

flights[(carrier=="AA")[order(origin, dest)], 
       .(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)), 
       by=.(origin, dest, month)]

But my call to order doesn't seem to have done anything. Why?

I can achieve this in dplyr with:

 flights %>%
  filter(carrier=="AA") %>%
  group_by(origin, dest, month) %>%
  summarize(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)) %>%
  as.data.table()

I'm curious how to do this in data.table and why my approach did not work.

Upvotes: 4

Views: 1288

Answers (1)

cderv
cderv

Reputation: 6542

With data.table, you can use keyby when you want to sort a result after a by operation. From help("data.table", package = "data.table")

keyby
Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.

You could then use keyby instead of by in your code


library(data.table)
# get the data from the web
flights <- fread("https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv")
# keyby will call setkey and then sort your result
ans <- flights[carrier == "AA",
               .(mean(arr_delay), mean(dep_delay)),
               keyby = .(origin, dest, month)]
ans
#>      origin dest month         V1         V2
#>   1:    EWR  DFW     1   6.427673 10.0125786
#>   2:    EWR  DFW     2  10.536765 11.3455882
#>   3:    EWR  DFW     3  12.865031  8.0797546
#>   4:    EWR  DFW     4  17.792683 12.9207317
#>   5:    EWR  DFW     5  18.487805 18.6829268
#>  ---                                        
#> 196:    LGA  PBI     1  -7.758621  0.3103448
#> 197:    LGA  PBI     2  -7.865385  2.4038462
#> 198:    LGA  PBI     3  -5.754098  3.0327869
#> 199:    LGA  PBI     4 -13.966667 -4.7333333
#> 200:    LGA  PBI     5 -10.357143 -6.8571429

Upvotes: 4

Related Questions