Andi Lope
Andi Lope

Reputation: 33

Plot the Share of one Category of a Categorical Variable with Respect to all Categories of a Second Variable

I have a dataframe like this:

df <- data.frame(Reason = sample(rep(c("R1", "R2", "R3", "R4"), each = 100)),
                 Answer = sample(rep(c("yes", "no", "no", "no"), 100)))

head(df)

I want ggplot to do a bar plot that shows the share of "yes"-answers (y-axis) for every reason (x-axis).

I tried this:

ggplot(data = df, aes(x = interaction(Reason, Answer))) + 
 geom_bar(aes(y = ..count../sum(..count..)))

This leads to the following outcome:

how it looks like

The problem is that the bars sum up to 1 (in total). I want them to sum up to one within each Reason-category. (R1.no and R1.yes should sum up to 1, R2.no and R2.yes should sum up to one and so on).

When this is done, I want to discard all bars bearing information about the "no"-answers. So basically, I just want the shares of the "yes"-answers within each Reason-category. This should look something like that:

how it should look like

I obtained the desired result doing this:

a <- prop.table(table(df$Reason, df$Answer),1)

df2 <- data.frame(Reason = rownames(as.matrix(a)),
                  share = as.matrix(a)[,2])

ggplot(data = df2, aes(x = reorder(Reason, share), y = share)) + 
  geom_bar(stat = "identity") + 
  ylab("share of yes-answers")

Can I avoid this work-around and directly get the desired result from ggplot? This would have some major advantages for me.

Thanks alot, Andi

Upvotes: 1

Views: 918

Answers (2)

StupidWolf
StupidWolf

Reputation: 47008

The solution by Yuriy only works when it sums up to 100. I think you have to calculate the proportion somehow, otherwise you cannot sort before hand. So in the first part, I manipulate the data by adding a column p, 1 if yes 0 if no.

library(dplyr)
library(ggplot2)
set.seed(99)
df <- data.frame(
Reason = sample(rep(c("R1", "R2", "R3", "R4"), each = 100)),
Answer = sample(rep(c("yes", "no", "no", "no"), 100)))

head(df %>% mutate(p=as.numeric(Answer=="yes")),3)
  Reason Answer p
1     R3     no 0
2     R3    yes 1
3     R1     no 0

Then we plot with this data frame, and the y axis is simply the mean of each group on the x-axis, and we can use stat_summary with fun.y=mean. Now reorder works very well in this case because it calculates the averages of each category and reorders according to that:

ggplot(df %>% mutate(p=as.numeric(Answer=="yes")),
aes(x=reorder(Reason,p),y=p)) +
 stat_summary(fun.y="mean",geom="bar",fill="orchid4")

enter image description here

And this will work for situations when you have different number of observations for different categories:

set.seed(100)
df <- data.frame(
Reason = rep(c("R1", "R2", "R3", "R4"),times=seq(50,200,length.out=4)),
Answer = sample(c("yes","no"),500,prob=c(0.5,0.5),replace=TRUE)
)
# we expect
sort(tapply(df$Answer=="yes",df$Reason,mean))
R2    R4    R3    R1 
0.460 0.505 0.520 0.540 

ggplot(df %>% mutate(p=as.numeric(Answer=="yes")),
    aes(x=reorder(Reason,p),y=p)) +
     stat_summary(fun.y="mean",geom="bar",fill="orange")

enter image description here

Upvotes: 1

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

ggplot(df[df$Answer == "yes", ]) + 
  geom_bar(aes(x = Reason, y = sort(..prop..), group = 1))

Upvotes: 0

Related Questions