How to group_by and then summarise which rows have NA in all columns

Question

I am not sure if this is possible. I want to be able to use summarise to count all the rows that have NA in all the columns besides the group_by. I am able to do it by putting all 5 conditions together where I have NO_OL_Percent = then have to connect each column with &. If you can do it in SQL I should think you could do it with dplyr or purrr but seems like noone on the internet has tried this.

Data must be downloaded here

Code is below. It works but is there really not a way to use an all function for last row of code? I need to be able to do a group_by first and I cannot use the filter_all in dplyr.

farmers_market = read.csv("Export.csv", stringsAsFactors = F, na.strings=c("NA","NaN", ""))

farmers_market %>% 
        select(c("Website", "Facebook", "Twitter", "Youtube", "OtherMedia", "State")) %>%
        group_by(State) %>%
        summarise(Num_Markets = n(),
                  FB_Percent = 100 - 100*sum(is.na(Facebook))/n(), 
                  TW_Percent = 100 - 100*sum(is.na(Twitter))/n(),
                  #fb=sum(is.na(Facebook)),
                  OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter))/n(),
                  NO_OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter) & is.na(Website) & is.na(Youtube) & is.na(OtherMedia))/n()
                  )

Ronak Shah · Accepted Answer

I removed the select statement since we are summarising, only relevant columns will be selected anyway. Created a cols vector from where we want to calculate NAs.

We first check for every row if that row has all NA values in cols columns and assign TRUE/FALSE value to new column all_NA. We then group_by State and perform the calculation for rest of columns as it is but for NO_OL_Percent we sum ALL_NA to get total number of NAs per group and divide it to total number of rows in group.

library(dplyr)

cols <- c("Website", "Facebook", "Twitter", "Youtube", "OtherMedia")

farmers_market %>% 
   mutate(all_NA = rowSums(is.na(.[cols])) == length(cols)) %>%
   group_by(State) %>%
   summarise(Num_Markets = n(),
             FB_Percent = 100 - 100*sum(is.na(Facebook))/n(), 
             TW_Percent = 100 - 100*sum(is.na(Twitter))/n(),
             OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter))/n(),
             NO_OL_Percent = 100 - 100*sum(all_NA)/n())


#    State                Num_Markets FB_Percent TW_Percent OL_Percent NO_OL_Percent
#                                                     
# 1 Alabama                      139       25.9       5.76       25.9          37.4
# 2 Alaska                        38       42.1      10.5        42.1          65.8
# 3 Arizona                       92       57.6      27.2        57.6          80.4
# 4 Arkansas                     111       52.3       4.50       52.3          61.3
# 5 California                   759       41.5      14.5        43.2          70.1
# 6 Colorado                     161       44.1       9.94       44.1          82.6
# 7 Connecticut                  157       33.8      12.1        33.8          53.5
# 8 Delaware                      36       61.1      11.1        61.1          83.3
# 9 District of Columbia          57       50.9      43.9        50.9          87.7
#10 Florida                      262       43.1       8.78       43.1          83.2
# … with 43 more rows

This gives the same output as your current approach but without manually writing all names.

How to group_by and then summarise which rows have NA in all columns

Answers (2)

Related Questions