Nicosc
Nicosc

Reputation: 323

Calculating percentage for several columns with the same values

I have a dataset in which four variables have the same values. Now I want to calculate the percentage shares for the values in each variable, so I can plot them in a stacked bar chart.

This is an example dataset:

   climate_change            air_quality              water_polution             trash                 
   <chr>                     <chr>                    <chr>                      <chr>                 
 1 Not a very serious probl~ A somewhat serious prob~ A somewhat serious problem A very serious problem
 2 Not a very serious probl~ Not a very serious prob~ Not a very serious problem Not a very serious pr~
 3 NA                        NA                       NA                         NA                    
 4 NA                        NA                       NA                         NA                    
 5 A very serious problem    A very serious problem   A very serious problem     A very serious problem
 6 A somewhat serious probl~ A very serious problem   Not at all a serious prob~ A somewhat serious pr~

I know how to calculate the percentage shares for each variable, e.g.:

lebanon %>%
  filter(!is.na(climate_change)) %>%
  count(climate_change) %>%
  mutate(prop = n / sum(n))

Getting:

  climate_change                   n   prop
  <chr>                        <int>  <dbl>
1 A somewhat serious problem     348 0.286 
2 A very serious problem         620 0.510 
3 Not a very serious problem     202 0.166 
4 Not at all a serious problem    45 0.0370

What I want now is to find a solution for keeping the values listed as rows while having the variables listed as columns with n and/or prop values. What would be the most efficient way to get that?

I want to have something like that:

                             climate_change    air_quality   .....   .....
  <chr>                         <dbl>
1 A somewhat serious problem      0.286           .....
2 A very serious problem          0.510           .....
3 Not a very serious problem      0.166 
4 Not at all a serious problem   0.0370

I had a hard time desciribing this problem and to find a similiar question on this site. I hope I have described it well and in the case you know a similiar question, just link it here. :)

Greetings

Upvotes: 3

Views: 45

Answers (2)

Ahorn
Ahorn

Reputation: 3876

Like this?

library(tidyverse)
df %>% 
  pivot_longer(1:4) %>% 
  filter(!is.na(value)) %>% 
  count(name, value) %>% 
  group_by(name) %>% 
  mutate(prop = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = name, values_from = prop)

# A tibble: 4 x 5
  value                     air_quality climate_change trash water_polution
  <chr>                           <dbl>          <dbl> <dbl>          <dbl>
1 A somewhat serious probl         0.25           0.25  0.25           0.25
2 A very serious problem           0.5            0.25  0.5            0.25
3 Not a very serious probl         0.25           0.5   0.25           0.25
4 Not at all a serious prob       NA             NA    NA              0.25

Data

df <- tibble::tribble(
                   ~climate_change,              ~air_quality,              ~water_polution,                   ~trash,
        "Not a very serious probl", "A somewhat serious probl", "A somewhat serious probl", "A very serious problem",
        "Not a very serious probl", "Not a very serious probl", "Not a very serious probl",  "Not a very serious probl",
                                NA,                        NA,                           NA,                       NA,
                                NA,                        NA,                           NA,                       NA,
          "A very serious problem",  "A very serious problem",     "A very serious problem", "A very serious problem",
        "A somewhat serious probl",  "A very serious problem",  "Not at all a serious prob",  "A somewhat serious probl"
        )

Upvotes: 0

JohannesNE
JohannesNE

Reputation: 1363

You can use the pivot functions from {tidyr} to apply your solution to a long form version of the data frame, and then pivot it back to the original shape.

data <- tribble(~Q1, ~Q2, ~Q3,
                'ans1', 'ans1', 'ans1',
                'ans1', 'ans2', 'ans2',
                'ans2', 'ans2', 'ans2',
                'ans1', 'ans3', 'ans2',
                'ans3', 'ans1', NA,
                'ans3', 'ans3', 'ans1',
                 NA   , 'ans2', NA,)

data %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  count(value) %>% 
  drop_na() %>%                 # If you omit this line, NA values will be
                                # counted as a separate answer.
  mutate(prop = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(values_from = prop, values_fill = list(prop = 0)) 
  # If there is no proportion for a given Q/A combination, 
  # it is because the answer has not been given to this question.

# A tibble: 3 x 4
  value    Q1    Q2    Q3
  <chr> <dbl> <dbl> <dbl>
1 ans1  0.5   0.286   0.4
2 ans2  0.167 0.429   0.6
3 ans3  0.333 0.286   0  

Upvotes: 2

Related Questions