94621
94621

Reputation: 89

ggplot summarise mean value of categorical variable on y axis

I am trying to replicate a Python plot in R that I found in this Kaggle notebook: Titanic Data Science Solutions

This is the Python code to generate the plot, the dataset used can be found here:

import seaborn as sns
...

grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()

Here is the resulting plot.

Seaborn Plot

The survival column takes values of 0 and 1 (survive or not survive) and the y-axis is displaying the mean per pclass. When searching for a way to calculate the mean using ggplot2, I usually find the stat_summary() function. The best I could do was this:

library(dplyr)
library(ggplot2)
...

train_df %>%
  ggplot(aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
  stat_summary(fun.y = mean, geom = "line") +
  facet_grid(Embarked ~ .)

The output can be found here.

R Plot Output

There are some issues:

I think I also haven't fully grasped the layering concept of ggplot. I would like to separate the geom = "line" in the stat_summary() function and rather add it as a + geom_line().

Upvotes: 0

Views: 1207

Answers (1)

Axeman
Axeman

Reputation: 35397

There is actually an empty level (i.e. "") in train_df$Embarked. You can filter that out before plotting.

train_df <- read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
train_df <- subset(train_df, Embarked != "")

ggplot(train_df, aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
  stat_summary(fun.data = 'mean_cl_boot') +
  geom_line(stat = 'summary', fun.y = mean) +
  facet_grid(Embarked ~ .)

You can replicate the python plot by drawing confidence intervals using stat_summary. Although your lines with stat_summary were great, I've rewritten it as a geom_line call, as you asked.

Note that your ggplot code doesn't draw any points, so I can't answer that part, but probably you were drawing the raw values which are just many 0s and 1s.

enter image description here

Upvotes: 3

Related Questions