Reputation: 131
I am working through R for Data Science exercises to teach myself R, and when trying to find different solutions to the same questions, I ran into a result that puzzled me.
I loaded the following packages:
library(nycflights13)
library(tidyverse)
The question is: Look at the number of cancelled flights per day (flights
data set). Is there a pattern? Is the proportion of cancelled flights related to the average delay?
I found a solution that describes the pattern well:
flights %>%
group_by(year, month, day) %>%
summarize(cancelled = mean(is.na(arr_delay)) , avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = avg_delay, y = cancelled)) +
geom_point(alpha=0.5) +
geom_smooth(se=FALSE)
The following code (exchanging mean()
with sum()/n()
for cancelled
flights) gives exactly the same picture:
flights %>%
group_by(year, month, day) %>%
summarize(cancelled = sum(is.na(arr_delay))/n() , avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = avg_delay, y = cancelled)) +
geom_point(alpha=0.5) +
geom_smooth(se=FALSE)
But when I do the same for avg_delay
, the picture changes:
flights %>%
group_by(year, month, day) %>%
summarize(cancelled = sum(is.na(arr_delay))/n() , avg_delay = sum(arr_delay, na.rm = TRUE)/n()) %>%
ggplot(mapping = aes(x = avg_delay, y = cancelled)) +
geom_point(alpha=0.5) +
geom_smooth(se=FALSE)
I would have expected all expressions to give the same result. My notion would be that the missing values are sometimes considered and sometimes not and thus the picture changes, but I lack the R knowledge to test for the difference. Can anyone advise what I need to do in order to clarify where the difference comes from?
Upvotes: 2
Views: 139
Reputation: 5893
This is because using mean
with na.rm = TRUE
will only consider those rows that are not NA
. So the length will not be equal to n()
!
Consider e.g. in your last example, using
avg_delay = sum(arr_delay, na.rm = TRUE)/sum(!is.na(arr_delay))
Will yield exactly the same as the two methods above
Upvotes: 4