Jeremy K.
Jeremy K.

Reputation: 1792

Taking a count() after group_by() for non-missing values

I have with some data with missing values (i.e. NA values), the simplified format is below (code for input at the end):


#>   id   x country
#> 1  1 2.0     USA
#> 2  2 4.0     USA
#> 3  3 3.5     JPN
#> 4  4  NA     JPN

For each country, I'd like to take the mean of x, and a count of usable values of x (i.e. not NA), so I've used group_by, and it works for the mean:

df <- df %>% group_by(country) %>% 
  mutate(mean_x = mean(x, na.rm = TRUE),
        #count_x = count(x)) 
        )

df
#> # A tibble: 4 x 4
#> # Groups:   country [2]
#>      id     x country mean_x
#>   <dbl> <dbl> <fct>    <dbl>
#> 1     1   2   USA        3  
#> 2     2   4   USA        3  
#> 3     3   3.5 JPN        3.5
#> 4     4  NA   JPN        3.5

but when I try to add the count(), I'm getting an error

library(tidyverse)
df <- data.frame(id = c(1, 2, 3, 4),
                  x = c(2, 4, 3.5, NA),
                  country = c("USA", "USA", "JPN", "JPN")
                 )
df
df <- df %>% group_by(country) %>% 
  mutate(mean_x = mean(x, na.rm = TRUE),
        count_x = count(x)) 
        )

df

#> Error in UseMethod("summarise_") : no applicable method for 'summarise_' applied to an 
#> object of class "c('double', 'numeric')"

My desired output would be:

#>      id     x country mean_x  count
#>   <dbl> <dbl> <fct>    <dbl>
#> 1     1   2   USA        3     2
#> 2     2   4   USA        3     2
#> 3     3   3.5 JPN        3.5   1
#> 4     4  NA   JPN        3.5   1

Reproducible code below:

library(tidyverse)
df <- data.frame(id = c(1, 2, 3, 4),
                  x = c(2, 4, 3.5, NA),
                  country = c("USA", "USA", "JPN", "JPN")
                 )
df
df <- df %>% group_by(country) %>% 
  mutate(mean_x = mean(x, na.rm = TRUE),
        count_x = count(x)) 
        )

df

Upvotes: 2

Views: 2640

Answers (2)

akrun
akrun

Reputation: 887118

We can also create the 'count' with a group_by n()

library(dplyr)
df %>% 
    group_by(country) %>% 
    mutate(mean_x = mean(x, na.rm = TRUE)) %>%
    summarise(n = n())
# A tibble: 2 x 2
#  country     n
#  <fct>   <int>
#1 JPN         2
#2 USA         2

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

count is not the right function here. The first argument to count is a dataframe or tibble specifically. However, what you are passing is a vector hence you get the error. Also count summarises the dataframe so that you have only one row per group. See for example,

library(dplyr)

df %>% 
  group_by(country) %>% 
  mutate(mean_x = mean(x, na.rm = TRUE)) %>%
  count(country)

#  country     n
#  <fct>   <int>
#1 JPN         2
#2 USA         2

If you want to add a new column without summarising, use add_count instead

df %>% 
  group_by(country) %>% 
  mutate(mean_x = mean(x, na.rm = TRUE)) %>%
  add_count(country)

#     id     x country mean_x     n
#  <dbl> <dbl> <fct>    <dbl> <int>
#1     1   2   USA        3       2
#2     2   4   USA        3       2
#3     3   3.5 JPN        3.5     2
#4     4  NA   JPN        3.5     2

However, both of this function don't do what you need. To count non-NA values per group, you need

df %>% 
  group_by(country) %>% 
  mutate(mean_x = mean(x, na.rm = TRUE), 
         count = length(na.omit(x)))
         #OR
         #count = sum(!is.na(x)))#as @Humpelstielzchen mentioned


#    id     x country mean_x count
#  <dbl> <dbl> <fct>    <dbl> <int>
#1     1   2   USA        3       2
#2     2   4   USA        3       2
#3     3   3.5 JPN        3.5     1
#4     4  NA   JPN        3.5     1

Upvotes: 1

Related Questions