Michael Visconti
Michael Visconti

Reputation: 65

trying to get count per month instead of count total

I am trying to get the number of canceled flights per month alongside these other two columns, instead i can only seem to get the total number of flights next to all the months.

Here is my code:

library(nycflights13)
flights = nycflights13::flights

flights %>% select(arr_delay,month,dep_time) %>%
   group_by(month) %>% 
   summarise(Mean = mean(arr_delay, na.rm = TRUE), canceled = count(filter(flights, is.na(dep_time))))
month   Mean canceled$n
   <int>  <dbl>      <int>
 1     1  6.13        8255
 2     2  5.61        8255
 3     3  5.81        8255
 4     4 11.2         8255
 5     5  3.52        8255
 6     6 16.5         8255
 7     7 16.7         8255
 8     8  6.04        8255
 9     9 -4.02        8255
10    10 -0.167       8255
11    11  0.461       8255
12    12 14.9         8255

Upvotes: 0

Views: 263

Answers (1)

r2evans
r2evans

Reputation: 160447

By calling filter(flights, ..) within the mutate of a grouped frame, you're looking at the entire flights, not the data present within the current group.

I suggest

flights %>%
   select(arr_delay,month,dep_time) %>%
   group_by(month) %>% 
   summarise(
     Mean = mean(arr_delay, na.rm = TRUE),
     canceled = sum(is.na(dep_time))
   )
# # A tibble: 12 x 3
#    month   Mean canceled
#    <int>  <dbl>    <int>
#  1     1  6.13       521
#  2     2  5.61      1261
#  3     3  5.81       861
#  4     4 11.2        668
#  5     5  3.52       563
#  6     6 16.5       1009
#  7     7 16.7        940
#  8     8  6.04       486
#  9     9 -4.02       452
# 10    10 -0.167      236
# 11    11  0.461      233
# 12    12 14.9       1025

This is similar to the rationale of not using the original frame name. For instance, if we try

mtcars %>%
  filter(disp > 350) %>%
  summarize(
    mu1 = mean(mtcars$mpg),
    mu2 = mean(mpg)
  )
#        mu1      mu2
# 1 20.09062 14.78571

mtcars originally has 32 rows, but only 7 rows after filter(disp > 350). For the calculation of mu1, we are reaching out to look at the original mtcars, all 32 rows of it; for mu2, we are only looking at the rows present in the data at that point in time, only 7 rows in this example.

So anytime you start a pipe with an object, the only reason you should ever use that object name again in a dplyr verb is if you intentionally want to look at the original state of the frame. In your case, I think you did not, you needed to look at the grouped/filtered data at that point in the pipeline.

Upvotes: 1

Related Questions