danpo
danpo

Reputation: 35

Clean way to calculate both group and overall statistics

I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both at the same time.

Right now, what I'm doing is calculate both statistics separately; then join the results; then tidy the data if necessary. Here's an example of what this may look like if I wanted to know the median delay per day and per month:

library(dplyr)
library(hflights)

data(hflights)

# Calculate both statistics separately
per_day <- hflights %>% 
  group_by(Year, Month, DayofMonth) %>%
  summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
  mutate(Interval = "Daily")

per_month <- hflights %>% 
  group_by(Year, Month) %>%
  summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
  mutate(Interval = "Monthly", DayofMonth = NA)

# Join into a single data.frame
my_summary <- full_join(per_day, per_month,
                        by = c("Year", "Month", "DayofMonth", "Interval", "Delay"))

my_summary
# Source: local data frame [377 x 5]
# Groups: Year, Month
# 
# Year Month DayofMonth     Delay Interval
# 1  2011     1          1 10.067642    Daily
# 2  2011     1          2 10.509745    Daily
# 3  2011     1          3  6.038627    Daily
# 4  2011     1          4  7.970740    Daily
# 5  2011     1          5  4.172650    Daily
# 6  2011     1          6  6.069909    Daily
# 7  2011     1          7  3.907295    Daily
# 8  2011     1          8  3.070140    Daily
# 9  2011     1          9 17.254325    Daily
# 10 2011     1         10 11.040388    Daily
# ..  ...   ...        ...       ...      ...

Are there better ways to do this?

(Note that in many cases one could easily progressively roll up summaries as pointed out in the Introduction to dplyr. However, this doesn't work for statistics like median, mean etc.)

Upvotes: 2

Views: 80

Answers (1)

Frank
Frank

Reputation: 66819

As a one-off table. This is fairly straightforward in data.table:

require(data.table)

setDT(hflights)[,{
  mo_del <- mean(ArrDelay,na.rm=TRUE)
  .SD[,.(DailyDelay = mean(ArrDelay,na.rm=TRUE),MonthlyDelay = mo_del),by=DayofMonth]
},by=.(Year,Month)]
  #    Year Month DayofMonth DailyDelay MonthlyDelay
  # 1: 2011     1          1 10.0676417     4.926065
  # 2: 2011     1          2 10.5097451     4.926065
  # 3: 2011     1          3  6.0386266     4.926065
  # 4: 2011     1          4  7.9707401     4.926065
  # 5: 2011     1          5  4.1726496     4.926065
 # ---                                              
# 361: 2011    12         14  1.0293610     5.013244
# 362: 2011    12         17 -0.1049822     5.013244
# 363: 2011    12         24 -4.1457490     5.013244
# 364: 2011    12         25 -2.2976827     5.013244
# 365: 2011    12         31 46.4846491     5.013244

How it works. The basic syntax is DT[i,j,by].

  • With by=.(Year,Month), all operations in j are done per "by group."
  • We can nest another "by group" using the data.table of the current Subset of Data, .SD.
  • To return columns in j we use .(colname1=col1,colname2=col2,...).

Creating new variables. Alternately, we could create new variables in hflights using := in j.

hflights[,DailyDelay   := mean(ArrDelay,na.rm=TRUE),.(Year,Month,DayofMonth)]
hflights[,MonthlyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month)]

Then we can view the summary table:

hflights[,.GRP,.(Year,Month,DayofMonth,DailyDelay,MonthlyDelay)]
  #    Year Month DayofMonth DailyDelay MonthlyDelay .GRP
  # 1: 2011     1          1 10.0676417     4.926065    1
  # 2: 2011     1          2 10.5097451     4.926065    2
  # 3: 2011     1          3  6.0386266     4.926065    3
  # 4: 2011     1          4  7.9707401     4.926065    4
  # 5: 2011     1          5  4.1726496     4.926065    5
 # ---                                                   
# 361: 2011    12         14  1.0293610     5.013244  361
# 362: 2011    12         17 -0.1049822     5.013244  362
# 363: 2011    12         24 -4.1457490     5.013244  363
# 364: 2011    12         25 -2.2976827     5.013244  364
# 365: 2011    12         31 46.4846491     5.013244  365

(Something needed to be put in j here, so I used the "by group" code, .GRP.)

Upvotes: 3

Related Questions