stackinator
stackinator

Reputation: 5819

dplyr Summarise improperly excluding NA

We can group mtcars by cylinder and summarize miles per gallon with some simple code.

library(dplyr)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(avg = mean(mpg))

This provides the correct output shown below.

    cyl      avg
1     4 26.66364
2     6 19.74286
3     8 15.10000

If I kindly ask dplyr to exclude NA I get some weird results.

mtcars %>% 
  group_by(cyl) %>% 
  summarise(avg = mean(!is.na(mpg)))

Since there are no NA in this data set the results should be the same as above. But it averages all mpg to exactly "1". A problem with my code or a bug in dplyr?

    cyl   avg
1     4     1
2     6     1
3     8     1

My actual data set does have some NA that I need to exclude only for this summarization, but exhibits the same behavior.

Upvotes: 0

Views: 396

Answers (2)

InfiniteFlash
InfiniteFlash

Reputation: 1058

You want this:

mtcars %>% 
group_by(cyl) %>% 
summarise(avg = mean(mpg, na.rm = T))

# A tibble: 3 x 2
    cyl      avg
  <dbl>    <dbl>
1     4 26.66364
2     6 19.74286
3     8 15.10000

Right now, you're returning a logical vector with !is.na(mpg). When you take the mean() of a logical vector, it'll be coerced to 1, not the numeric value you desire.

Upvotes: 5

Sun Bee
Sun Bee

Reputation: 1820

The way you have coded it, the input to the mean() function is a vector of TRUE and FALSE values. Use mean(mpg[!is.na(mpg)]) instead.

Consider using data.table which I have used for illustration purposes. The following all produce the same result.

library(data.table)
MT[, mean(mpg), by = cyl]
   cyl       V1
1:   6 19.74286
2:   4 26.66364
3:   8 15.10000

MT[, mean(mpg, na.rm=TRUE), by = cyl]
   cyl       V1
1:   6 19.74286
2:   4 26.66364
3:   8 15.10000

MT[, mean(mpg[!is.na(mpg)]), by = cyl]
   cyl       V1
1:   6 19.74286
2:   4 26.66364
3:   8 15.10000

Upvotes: 0

Related Questions