Reputation: 505
I am not sure if this is possible. I want to be able to use summarise to count all the rows that have NA in all the columns besides the group_by. I am able to do it by putting all 5 conditions together where I have NO_OL_Percent =
then have to connect each column with &
. If you can do it in SQL I should think you could do it with dplyr or purrr but seems like noone on the internet has tried this.
Data must be downloaded here
Code is below. It works but is there really not a way to use an all function for last row of code? I need to be able to do a group_by first and I cannot use the filter_all in dplyr.
farmers_market = read.csv("Export.csv", stringsAsFactors = F, na.strings=c("NA","NaN", ""))
farmers_market %>%
select(c("Website", "Facebook", "Twitter", "Youtube", "OtherMedia", "State")) %>%
group_by(State) %>%
summarise(Num_Markets = n(),
FB_Percent = 100 - 100*sum(is.na(Facebook))/n(),
TW_Percent = 100 - 100*sum(is.na(Twitter))/n(),
#fb=sum(is.na(Facebook)),
OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter))/n(),
NO_OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter) & is.na(Website) & is.na(Youtube) & is.na(OtherMedia))/n()
)
Upvotes: 1
Views: 1338
Reputation: 1064
A straight way to get the Percent
column would be:
farmers_market %>%
select("Website", "Facebook", "Twitter", "Youtube", "OtherMedia", "State") %>%
group_by(State) %>%
summarise_all(funs("Percent" = sum(is.na(.))/n()))
# A tibble: 53 x 6
# State Website_Percent Facebook_Percent Twitter_Percent Youtube_Percent OtherMedia_Percent
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Alabama 0.727 0.741 0.942 0.993 0.964
#2 Alaska 0.447 0.579 0.895 1 0.974
To add the num_markets
column, an option is do this:
farmers_market %>%
select("Website", "Facebook", "Twitter", "Youtube", "OtherMedia", "State") %>%
group_by(State) %>%
mutate(num_markets = n()) %>%
group_by(State, num_markets) %>%
summarise_all(funs("Percent" = sum(is.na(.))/n()))
# A tibble: 53 x 7
# Groups: State [2]
# State num_markets Website_Percent Facebook_Percent Twitter_Percent Youtube_Percent OtherMedia_Percent
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Alabama 139 0.727 0.741 0.942 0.993 0.964
#2 Alaska 38 0.447 0.579 0.895 1 0.974
Upvotes: 0
Reputation: 389235
I removed the select
statement since we are summarising, only relevant columns will be selected anyway. Created a cols
vector from where we want to calculate NA
s.
We first check for every row if that row has all NA
values in cols
columns and assign TRUE
/FALSE
value to new column all_NA
. We then group_by
State
and perform the calculation for rest of columns as it is but for NO_OL_Percent
we sum ALL_NA
to get total number of NA
s per group and divide it to total number of rows in group.
library(dplyr)
cols <- c("Website", "Facebook", "Twitter", "Youtube", "OtherMedia")
farmers_market %>%
mutate(all_NA = rowSums(is.na(.[cols])) == length(cols)) %>%
group_by(State) %>%
summarise(Num_Markets = n(),
FB_Percent = 100 - 100*sum(is.na(Facebook))/n(),
TW_Percent = 100 - 100*sum(is.na(Twitter))/n(),
OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter))/n(),
NO_OL_Percent = 100 - 100*sum(all_NA)/n())
# State Num_Markets FB_Percent TW_Percent OL_Percent NO_OL_Percent
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Alabama 139 25.9 5.76 25.9 37.4
# 2 Alaska 38 42.1 10.5 42.1 65.8
# 3 Arizona 92 57.6 27.2 57.6 80.4
# 4 Arkansas 111 52.3 4.50 52.3 61.3
# 5 California 759 41.5 14.5 43.2 70.1
# 6 Colorado 161 44.1 9.94 44.1 82.6
# 7 Connecticut 157 33.8 12.1 33.8 53.5
# 8 Delaware 36 61.1 11.1 61.1 83.3
# 9 District of Columbia 57 50.9 43.9 50.9 87.7
#10 Florida 262 43.1 8.78 43.1 83.2
# … with 43 more rows
This gives the same output as your current approach but without manually writing all names.
Upvotes: 1