Reputation: 33
I have a dataframe like this:
df <- data.frame(Reason = sample(rep(c("R1", "R2", "R3", "R4"), each = 100)),
Answer = sample(rep(c("yes", "no", "no", "no"), 100)))
head(df)
I want ggplot to do a bar plot that shows the share of "yes"-answers (y-axis) for every reason (x-axis).
I tried this:
ggplot(data = df, aes(x = interaction(Reason, Answer))) +
geom_bar(aes(y = ..count../sum(..count..)))
This leads to the following outcome:
The problem is that the bars sum up to 1 (in total). I want them to sum up to one within each Reason-category. (R1.no and R1.yes should sum up to 1, R2.no and R2.yes should sum up to one and so on).
When this is done, I want to discard all bars bearing information about the "no"-answers. So basically, I just want the shares of the "yes"-answers within each Reason-category. This should look something like that:
I obtained the desired result doing this:
a <- prop.table(table(df$Reason, df$Answer),1)
df2 <- data.frame(Reason = rownames(as.matrix(a)),
share = as.matrix(a)[,2])
ggplot(data = df2, aes(x = reorder(Reason, share), y = share)) +
geom_bar(stat = "identity") +
ylab("share of yes-answers")
Can I avoid this work-around and directly get the desired result from ggplot? This would have some major advantages for me.
Thanks alot, Andi
Upvotes: 1
Views: 918
Reputation: 47008
The solution by Yuriy only works when it sums up to 100. I think you have to calculate the proportion somehow, otherwise you cannot sort before hand. So in the first part, I manipulate the data by adding a column p, 1 if yes 0 if no.
library(dplyr)
library(ggplot2)
set.seed(99)
df <- data.frame(
Reason = sample(rep(c("R1", "R2", "R3", "R4"), each = 100)),
Answer = sample(rep(c("yes", "no", "no", "no"), 100)))
head(df %>% mutate(p=as.numeric(Answer=="yes")),3)
Reason Answer p
1 R3 no 0
2 R3 yes 1
3 R1 no 0
Then we plot with this data frame, and the y axis is simply the mean of each group on the x-axis, and we can use stat_summary
with fun.y=mean
. Now reorder
works very well in this case because it calculates the averages of each category and reorders according to that:
ggplot(df %>% mutate(p=as.numeric(Answer=="yes")),
aes(x=reorder(Reason,p),y=p)) +
stat_summary(fun.y="mean",geom="bar",fill="orchid4")
And this will work for situations when you have different number of observations for different categories:
set.seed(100)
df <- data.frame(
Reason = rep(c("R1", "R2", "R3", "R4"),times=seq(50,200,length.out=4)),
Answer = sample(c("yes","no"),500,prob=c(0.5,0.5),replace=TRUE)
)
# we expect
sort(tapply(df$Answer=="yes",df$Reason,mean))
R2 R4 R3 R1
0.460 0.505 0.520 0.540
ggplot(df %>% mutate(p=as.numeric(Answer=="yes")),
aes(x=reorder(Reason,p),y=p)) +
stat_summary(fun.y="mean",geom="bar",fill="orange")
Upvotes: 1
Reputation: 8880
ggplot(df[df$Answer == "yes", ]) +
geom_bar(aes(x = Reason, y = sort(..prop..), group = 1))
Upvotes: 0