How to perform further grouping and lookups on aggregated values in .SD for a data.table -

Question

This is related to the previous question on grouping/lookups on data.table but with additional output.

I am trying to perform a grouping on the subset .SD and select values from the each subset. For example, in the flights dataset, i would like to find out : For each airport and month, which UniqueCarrier and Destination has the worst avg ArrDelay. So there are essentially 2 levels of aggregation.

I have the working solution as follows.. But it would be good to understand if there are any better solutions.

library(data.table)
library(hflights)

DT <- as.data.table(hflights)

setkey(DT, Origin, Month)

#The solution code...
DT[, {
 t1 <- .SD[, .(mean(na.omit(ArrDelay))) , by=UniqueCarrier];
 max1 <- which.max(t1$V1);
 t2 <- .SD[, .(mean(na.omit(ArrDelay))) , by=Dest];
 max2 <- which.max(t2$V1);
 list( MaxAvgDelayForCarrier = t1$UniqueCarrier[max1], MaxAvgDelayByCarrier = t1$V1[max1],  MaxAvgDelayByDest= t2$Dest[max2], MaxAvgDelayForDest= t2$V1[max2] )
},  by = .(Origin, Month)]

# Checking for correctness
head(DT[ .("HOU", 1), .(MaxAvgDelayByCarrier=mean(na.omit(ArrDelay))), by=UniqueCarrier][order(-MaxAvgDelayByCarrier)],1)
head(DT[ .("IAH", 2), .(MaxAvgDelayForDest=mean(na.omit(ArrDelay))), by=Dest][order(-MaxAvgDelayForDest)],1)

Frank · Accepted Answer

I think your code is fine, but I would write it this way:

DT[,c(

  .SD[, 
    .(CMaxVal = mean(na.omit(ArrDelay))),
  by=.(CMax = UniqueCarrier)][which.max(CMaxVal)],

  .SD[, 
    .(DMaxVal = mean(na.omit(ArrDelay))),
  by=.(DMax = Dest)][which.max(DMaxVal)]

),by=key(DT)]

which gives

    Origin Month CMax   CMaxVal DMax  DMaxVal
 1:    HOU     1   F9 13.725806  PHL 20.12500
 2:    HOU     2   B6 17.822222  ECP 20.17308
 3:    HOU     3   EV 23.088889  PHL 46.06452
 4:    HOU     4   EV 27.847826  PHL 67.93333
 5:    HOU     5   EV 25.436620  PHL 75.61290
 6:    HOU     6   EV 16.930233  EWR 34.87755
 7:    HOU     7   B6 20.016129  CHS 21.54839
 8:    HOU     8   B6 30.163636  JFK 30.16364
 9:    HOU     9   DL 18.625000  EWR 14.32143
10:    HOU    10   DL 17.803279  PHL 22.51613
11:    HOU    11   F9  3.000000  EWR 18.46429
12:    HOU    12   MQ 13.554502  EWR 28.17857
13:    IAH     1   EV 15.682353  HNL 21.52632
14:    IAH     2   MQ 19.946809  BPT 29.00000
15:    IAH     3   AS 15.354839  SFO 27.43590
16:    IAH     4   MQ 16.263441  SEA 22.48515
17:    IAH     5   MQ 25.179104  DAY 25.96154
18:    IAH     6   UA 24.453125  ANC 34.06667
19:    IAH     7   OO 15.117419  DSM 32.39286
20:    IAH     8   UA 17.297561  ANC 37.96552
21:    IAH     9   UA 11.620000  SJU 16.76923
22:    IAH    10   UA 11.601266  CID 16.88462
23:    IAH    11   MQ  8.445545  CID 18.04167
24:    IAH    12   XE 11.376852  HOB 25.95556
    Origin Month CMax   CMaxVal DMax  DMaxVal

There's no need to store so many intermediate objects (t1, max1, etc.).

The approach above requires manually coding for each grouping variable. You could instead do...

DT[,do.call(c,lapply(c("UniqueCarrier","Dest"),
  function(colname) setNames(
    .SD[,mean(na.omit(ArrDelay)),by=c(colname)][which.max(V1)],
    paste0(c("max_","maxval_"),colname)
  )
)),by=key(DT)]

How to perform further grouping and lookups on aggregated values in .SD for a data.table -

Answers (1)

Related Questions