Daniel
Daniel

Reputation: 435

Summarise but keep length variable (dplyr)

Basic dplyr question... Respondents could select multiple companies that they use. For example:

library(dplyr)
test <- tibble(
 CompanyA = rep(c(0:1),5),
 CompanyB = rep(c(1),10),
 CompanyC = c(1,1,1,1,0,0,1,1,1,1)
)
test

If it were a forced-choice question - i.e., respondents could make only one selection - I would do the following for a basic summary table:

test %>% 
  summarise_all(funs(sum), na.rm = TRUE) %>% 
  gather(Response, n) %>% 
  arrange(desc(n)) %>% 
  mutate("%" = round(100*n/sum(n)))

Note, however, that the "%" column is not what I want. I'm instead looking for the proportion of total respondents for each individual response option (since they could make multiple selections).

I've tried adding mutate(totalrows = nrow(.)) %>% prior to the summarise_all command. This would allow me to use that variable as the denominator in a later mutate command. However, summarise_all eliminates the "totalrows" var.

Also, if there's a better way to do this, I'm open to ideas.

Upvotes: 1

Views: 425

Answers (3)

tblznbits
tblznbits

Reputation: 6778

To get the proportion of respondents who chose an option when that variable is binary, you can take the mean. To do this with your test data, you can use sapply:

sapply(test, mean)
CompanyA CompanyB CompanyC 
     0.5      1.0      0.8 

If you wanted to do this in a more complicated fashion (say your data is not binary encoded, but is stored as 1 and 2 instead), you could do that with the following:

test %>% 
    gather(key='Company') %>% 
    group_by(Company) %>% 
    summarise(proportion = sum(value == 1) / n())

# A tibble: 3 x 2
  Company  proportion
  <chr>         <dbl>
1 CompanyA        0.5
2 CompanyB        1  
3 CompanyC        0.8

Upvotes: 3

struggles
struggles

Reputation: 865

If you put all functions in a list within summarise, then this will work. You'll need to do some quick tidying up after though.

test %>% 
  summarise_all(
    list(
      rows = length,
      n = function(x){sum(x, na.rm = T)},
      perc = function(x){sum(x,na.rm = T)/length(x)}
    )) %>%
  tidyr::gather(Response, n) %>%
  tidyr::separate(Response, c("Company", "Metric"), '_') %>%
  tidyr::spread(Metric, n)

And you'll get this

  Company      n  perc  rows
  <chr>    <dbl> <dbl> <dbl>
1 CompanyA     5   0.5    10
2 CompanyB    10   1      10
3 CompanyC     8   0.8    10

Upvotes: 1

dave-edison
dave-edison

Reputation: 3726

Here is a solution using tidyr::gather:

test %>% 
  gather(Company, response) %>% 
  group_by(Company) %>% 
  summarise(`%` = 100 * sum(response) / n())

Upvotes: 0

Related Questions