Reputation: 25
This is related to the previous question on grouping/lookups on data.table but with additional output.
I am trying to perform a grouping on the subset .SD and select values from the each subset. For example, in the flights dataset, i would like to find out : For each airport and month, which UniqueCarrier and Destination has the worst avg ArrDelay. So there are essentially 2 levels of aggregation.
I have the working solution as follows.. But it would be good to understand if there are any better solutions.
library(data.table)
library(hflights)
DT <- as.data.table(hflights)
setkey(DT, Origin, Month)
#The solution code...
DT[, {
t1 <- .SD[, .(mean(na.omit(ArrDelay))) , by=UniqueCarrier];
max1 <- which.max(t1$V1);
t2 <- .SD[, .(mean(na.omit(ArrDelay))) , by=Dest];
max2 <- which.max(t2$V1);
list( MaxAvgDelayForCarrier = t1$UniqueCarrier[max1], MaxAvgDelayByCarrier = t1$V1[max1], MaxAvgDelayByDest= t2$Dest[max2], MaxAvgDelayForDest= t2$V1[max2] )
}, by = .(Origin, Month)]
# Checking for correctness
head(DT[ .("HOU", 1), .(MaxAvgDelayByCarrier=mean(na.omit(ArrDelay))), by=UniqueCarrier][order(-MaxAvgDelayByCarrier)],1)
head(DT[ .("IAH", 2), .(MaxAvgDelayForDest=mean(na.omit(ArrDelay))), by=Dest][order(-MaxAvgDelayForDest)],1)
Upvotes: 1
Views: 60
Reputation: 66819
I think your code is fine, but I would write it this way:
DT[,c(
.SD[,
.(CMaxVal = mean(na.omit(ArrDelay))),
by=.(CMax = UniqueCarrier)][which.max(CMaxVal)],
.SD[,
.(DMaxVal = mean(na.omit(ArrDelay))),
by=.(DMax = Dest)][which.max(DMaxVal)]
),by=key(DT)]
which gives
Origin Month CMax CMaxVal DMax DMaxVal
1: HOU 1 F9 13.725806 PHL 20.12500
2: HOU 2 B6 17.822222 ECP 20.17308
3: HOU 3 EV 23.088889 PHL 46.06452
4: HOU 4 EV 27.847826 PHL 67.93333
5: HOU 5 EV 25.436620 PHL 75.61290
6: HOU 6 EV 16.930233 EWR 34.87755
7: HOU 7 B6 20.016129 CHS 21.54839
8: HOU 8 B6 30.163636 JFK 30.16364
9: HOU 9 DL 18.625000 EWR 14.32143
10: HOU 10 DL 17.803279 PHL 22.51613
11: HOU 11 F9 3.000000 EWR 18.46429
12: HOU 12 MQ 13.554502 EWR 28.17857
13: IAH 1 EV 15.682353 HNL 21.52632
14: IAH 2 MQ 19.946809 BPT 29.00000
15: IAH 3 AS 15.354839 SFO 27.43590
16: IAH 4 MQ 16.263441 SEA 22.48515
17: IAH 5 MQ 25.179104 DAY 25.96154
18: IAH 6 UA 24.453125 ANC 34.06667
19: IAH 7 OO 15.117419 DSM 32.39286
20: IAH 8 UA 17.297561 ANC 37.96552
21: IAH 9 UA 11.620000 SJU 16.76923
22: IAH 10 UA 11.601266 CID 16.88462
23: IAH 11 MQ 8.445545 CID 18.04167
24: IAH 12 XE 11.376852 HOB 25.95556
Origin Month CMax CMaxVal DMax DMaxVal
There's no need to store so many intermediate objects (t1
, max1
, etc.).
The approach above requires manually coding for each grouping variable. You could instead do...
DT[,do.call(c,lapply(c("UniqueCarrier","Dest"),
function(colname) setNames(
.SD[,mean(na.omit(ArrDelay)),by=c(colname)][which.max(V1)],
paste0(c("max_","maxval_"),colname)
)
)),by=key(DT)]
Upvotes: 3