How to calculate percentages in a stacked barplot bar-wise?

Question

Problem

The current percentages in the bar are calculate with the total amount of data. I want to each stack to have a fully 100%. (Solved)

Also the percentages should be rounded to the nearest integer. (Solved)

Edit: Remove all percentages below or equal to 1. (Solved)

Edit2: Make sure no labels are overlapping.

I've been googling for a while now. It seems like there isn't a proper way to prevent labels overlapping.

Possible solutions I discovered:

Flip the plot
Add angle() to rotate the labels
"Manually" calculate each position
Make use of check_overlap = TRUE

Current State

My Code so far

# Load libraries & packages =================================
library("ggplot2")
library("scales")
library("dplyr")
library("foreign")
library("tidyverse")
library("forcats")


# Data setup =================================
spss_file_path <- "D:\Programming\Testing\2017-03-15_data_import&ggplot2\Beispieldatensatz(fiktiv).sav"
exampledata <- read.spss(spss_file_path, use.value.labels = TRUE,
                         to.data.frame = TRUE, reencode = TRUE)


exampledata$V43   <- factor(exampledata$V43,
                            levels = c(1,2,3,4,5),
                            labels = c("1 Sehr zufrieden","2","3","4", "5 Sehr unzufrieden"))

exampledata$V43   <- factor(exampledata$V43, levels = rev(unique(levels(exampledata$V43))))
exampledata$A_REF <- factor(exampledata$A_REF, levels = rev(unique(levels(exampledata$A_REF))))
exampledata$V101  <- factor(exampledata$V101, levels = rev(unique(levels(exampledata$V101))))

labels <- exampledata %>% 
  filter(!is.na(V101), !is.na(V43)) %>% 
  count(A_REF) %>% 
  mutate(labels = paste(A_REF,"(n=", n, ")")) %>% 
  select(A_REF, labels)

plot_data <-  exampledata %>% 
  filter(!is.na(V101), !is.na(V43)) %>% 
  left_join(labels, by = "A_REF")

plot_data <- plot_data %>% 
  group_by(labels) %>% 
  summarize(`5 Sehr unzufrieden` = sum(ifelse(V43 == "5 Sehr unzufrieden", 1, 0)) / n(),
            `4` = sum(ifelse(V43 == "4", 1, 0)) / n(),
            `3` = sum(ifelse(V43 == "3", 1, 0)) / n(),
            `2` = sum(ifelse(V43 == "2", 1, 0)) / n(),
            `1 Sehr zufrieden` = sum(ifelse(V43 == "1 Sehr zufrieden", 1, 0)) / n()) %>%
  gather(key = Rating, value = prop, -labels)

plot_data$labels <- factor(plot_data$labels)
plot_data$Rating <- factor(plot_data$Rating) %>% fct_rev()

# Plot =================================
ggplot(plot_data, aes(x = labels, y = prop, fill = Rating)) +
  geom_col() + 
  scale_y_continuous(labels = scales::percent, breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1)) +
  labs(y=NULL, x=NULL, fill=NULL) + 
  ggtitle(paste(attr(exampledata, "variable.labels")[77])) + 
  theme_classic() + 
  geom_text(aes(label = if_else(prop > 0.02, scales::percent(round(prop, 2)), NULL)), position = position_fill(vjust=0.5)) +
  coord_flip()

Data

structure(list(exampledata.V101 = structure(c(2L, NA, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, NA, 2L, 2L, 2L, 1L, 2L, NA, 
NA, NA, 1L, 1L, 2L, NA, 2L, 2L, 2L, NA, 2L, 2L, NA, NA, 1L, NA, 
2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, NA, NA, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, NA, 1L, NA, 1L, NA, 
1L, 2L, NA, NA, 2L, NA, 1L, 2L, 2L, NA, 2L, NA, 2L, 2L, 1L, 2L, 
1L, 2L, 1L, 1L, 2L, 1L, NA, 2L, 2L, 2L, 2L, NA, 2L, 1L, 2L, 2L
), .Label = c("Weiblich", "Männlich"), class = "factor"), exampledata.A_REF = structure(c(18L, 
18L, 18L, 18L, 18L, 17L, 18L, 18L, 18L, 18L, 18L, 18L, 16L, 18L, 
18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 16L, 18L, 18L, 16L, 18L, 
16L, 18L, 18L, 17L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 
16L, 18L, 18L, 17L, 18L, 18L, 18L, 18L, 18L, 18L, 17L, 16L, 18L, 
18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 17L, 18L, 18L, 
16L, 18L, 16L, 18L, 18L, 16L, 16L, 18L, 18L, 18L, 18L, 18L, 18L, 
18L, 17L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 16L, 18L, 
16L, 16L, 18L, 18L, 18L, 17L, 16L, 18L), .Label = c("Zertifikat eines Aufbau- oder Ergänzungsstudiums", 
"LA Berufliche Schulen", "LA Sonderschule", "LA Gymnasium", "LA Haupt- und Realschule", 
"LA Grundschule", "Künstlerischer/musischer Abschluss", "Kirchlicher Abschluss", 
"Staatsexamen (ohne Lehramt)", "Diplom Fachhochschule, Diplom I an Gesamthochschulen", 
"Diplom Universität, Diplom II an Gesamthochschulen", "Sonstiges", 
"Promotion", "Staatsexamen", "Magister", "Diplom", "Master", 
"Bachelor"), class = "factor"), exampledata.V43 = structure(c(3L, 
5L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 4L, 3L, 3L, 2L, NA, 4L, 5L, 5L, 
4L, 4L, 4L, 4L, NA, 2L, 4L, 3L, 5L, 4L, 4L, 4L, NA, 4L, 4L, NA, 
NA, 3L, 5L, 2L, 4L, 5L, 4L, 4L, 5L, 5L, 4L, NA, NA, 4L, NA, 3L, 
4L, 5L, 5L, 2L, 4L, 4L, 3L, 4L, 4L, 4L, 3L, 5L, 4L, 5L, NA, 4L, 
NA, 4L, NA, 4L, 5L, 4L, NA, 5L, NA, 4L, 4L, 4L, NA, 4L, NA, 5L, 
4L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 2L, 4L, 4L, 4L, 3L, 4L, NA, 4L, 
5L, 5L, 4L), .Label = c("5 Sehr unzufrieden", "4", "3", "2", 
"1 Sehr zufrieden"), class = "factor")), .Names = c("exampledata.V101", 
"exampledata.A_REF", "exampledata.V43"), row.names = c(NA, 100L
), class = "data.frame")

Phil · Accepted Answer

It's usually preferable to manipulate your data into summarized data before charting it. I find that trying to have ggplot2 do the summarization for you is either limited or difficult to have it shown the way you want.

library(tidyverse)
library(forcats)

Because it's best to summarize your data before plotting it in ggplot2, the following bit of code calculates the proportion withing each group of label that selected a particular answer on the scale. In the final step I turned the data from wide to long, so that all the proportions to be charted are in the same variable (which I call prop).

plot_data <- plot_data %>% group_by(labels) %>% 
            summarize(`5 Sehr unzufrieden` = sum(ifelse(V43 == "5 Sehr unzufrieden", 1, 0)) / n(),
                      `4` = sum(ifelse(V43 == "4", 1, 0)) / n(),
                      `3` = sum(ifelse(V43 == "3", 1, 0)) / n(),
                      `2` = sum(ifelse(V43 == "2", 1, 0)) / n(),
                      `1 Sehr zufrieden` = sum(ifelse(V43 == "1 Sehr zufrieden", 1, 0)) / n()) %>%
            gather(key = Rating, value = prop, -labels)

It's preferable that categorical variables are set as factors for manipulating, say, the order and the colours, so this is what the following does. Initially, my code had the scale labels (which I called Rating in the gather function above) go in the reverse order than what you had, so I'm using fct_rev from the forcats package to reverse it back.

plot_data$labels <- factor(plot_data$labels)
plot_data$Rating <- factor(plot_data$Rating) %>% fct_rev()

For the chart below, I just made a couple of changes. The most notable is that I'm using geom_col instead of geom_bar. In the background, geom_col is the same as geom_bar(stat = "identity") - it's just quicker to type. We're essentially telling ggplot2 to chart the data as is instead of treating it like raw data. However, I do need to specify the y aesthetic to indicate what data I want charted, so I'm specifying to use the prop variable in the initial ggplot call.

# Plot =================================
ggplot(plot_data, aes(x = labels, y = prop, fill = Rating)) +
geom_col() + 
scale_y_continuous(labels = scales::percent, breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1)) +
labs(y=NULL, x=NULL, fill=NULL) + 
ggtitle(paste(attr(exampledata, "variable.labels")[77])) + 
theme_classic() + 
geom_text(aes(label = if_else(prop > 0.01, scales::percent(round(prop, 2)), NULL)), position = position_fill(vjust=0.5)) +
coord_flip()

The only other line I changed is the geom_text call above. I added an if_else function so that it either shows the label (if it's above 1%) or not (1% or less). Also, I rounded the percentage so that you don't have any decimals using the round function. Remember that you need to round to 2 decimal points.

How to calculate percentages in a stacked barplot bar-wise?

Problem

Current State

My Code so far

Data

Answers (2)

Related Questions