NelnewR
NelnewR

Reputation: 131

mean() and sum()/n() results do not match

I am working through R for Data Science exercises to teach myself R, and when trying to find different solutions to the same questions, I ran into a result that puzzled me.

I loaded the following packages:

library(nycflights13)
library(tidyverse)

The question is: Look at the number of cancelled flights per day (flights data set). Is there a pattern? Is the proportion of cancelled flights related to the average delay?

I found a solution that describes the pattern well:

flights %>%
 group_by(year, month, day) %>%
 summarize(cancelled = mean(is.na(arr_delay)) , avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
 ggplot(mapping = aes(x = avg_delay, y = cancelled)) +
 geom_point(alpha=0.5) + 
 geom_smooth(se=FALSE)

The following code (exchanging mean()with sum()/n() for cancelled flights) gives exactly the same picture:

flights %>%
 group_by(year, month, day) %>%
 summarize(cancelled = sum(is.na(arr_delay))/n() , avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
 ggplot(mapping = aes(x = avg_delay, y = cancelled)) +
 geom_point(alpha=0.5) + 
 geom_smooth(se=FALSE)

But when I do the same for avg_delay, the picture changes:

flights %>%
  group_by(year, month, day) %>%
  summarize(cancelled = sum(is.na(arr_delay))/n() , avg_delay = sum(arr_delay, na.rm = TRUE)/n()) %>%
  ggplot(mapping = aes(x = avg_delay, y = cancelled)) +
  geom_point(alpha=0.5) + 
  geom_smooth(se=FALSE)

I would have expected all expressions to give the same result. My notion would be that the missing values are sometimes considered and sometimes not and thus the picture changes, but I lack the R knowledge to test for the difference. Can anyone advise what I need to do in order to clarify where the difference comes from?

Upvotes: 2

Views: 139

Answers (1)

erocoar
erocoar

Reputation: 5893

This is because using mean with na.rm = TRUE will only consider those rows that are not NA. So the length will not be equal to n()!

Consider e.g. in your last example, using

avg_delay = sum(arr_delay, na.rm = TRUE)/sum(!is.na(arr_delay))

Will yield exactly the same as the two methods above

Upvotes: 4

Related Questions