Reputation: 1386
I'm learning data.table and am trying to subset and re-order grouped data while calculating averages, all in one data.table statement.
I got the flight data from here while following along in this tutorial.
From the tutorial,
How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]
ans
# origin dest month V1 V2
# 1: JFK LAX 1 6.590361 14.2289157
# 2: LGA PBI 1 -7.758621 0.3103448
# 3: EWR LAX 1 1.366667 7.5000000
# 4: JFK MIA 1 15.720670 18.7430168
# 5: JFK SEA 1 14.357143 30.7500000
# ---
# 196: LGA MIA 10 -6.251799 -1.4208633
# 197: JFK MIA 10 -1.880184 6.6774194
# 198: EWR PHX 10 -3.032258 -4.2903226
# 199: JFK MCO 10 -10.048387 -1.6129032
# 200: JFK DCA 10 16.483871 15.5161290
This makes sense to me. However, I want to improve on this by organizing the flights which departed from the same origin together. Following the schema of...
data.table(subset, select/compute, group by)
I came up with this:
flights[(carrier=="AA")[order(origin, dest)],
.(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)),
by=.(origin, dest, month)]
But my call to order doesn't seem to have done anything. Why?
I can achieve this in dplyr with:
flights %>%
filter(carrier=="AA") %>%
group_by(origin, dest, month) %>%
summarize(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)) %>%
as.data.table()
I'm curious how to do this in data.table and why my approach did not work.
Upvotes: 4
Views: 1288
Reputation: 6542
With data.table
, you can use keyby
when you want to sort a result after a by operation. From help("data.table", package = "data.table")
keyby
Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.
You could then use keyby
instead of by
in your code
library(data.table)
# get the data from the web
flights <- fread("https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv")
# keyby will call setkey and then sort your result
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
keyby = .(origin, dest, month)]
ans
#> origin dest month V1 V2
#> 1: EWR DFW 1 6.427673 10.0125786
#> 2: EWR DFW 2 10.536765 11.3455882
#> 3: EWR DFW 3 12.865031 8.0797546
#> 4: EWR DFW 4 17.792683 12.9207317
#> 5: EWR DFW 5 18.487805 18.6829268
#> ---
#> 196: LGA PBI 1 -7.758621 0.3103448
#> 197: LGA PBI 2 -7.865385 2.4038462
#> 198: LGA PBI 3 -5.754098 3.0327869
#> 199: LGA PBI 4 -13.966667 -4.7333333
#> 200: LGA PBI 5 -10.357143 -6.8571429
Upvotes: 4