C'estLindION
C'estLindION

Reputation: 51

Seaborn/Matplotlib categorical plot markers size by count of observations

I want to scale markers on a plot of 2 categorical variables by count of observations.

I am using seaborn.pairplot for easiness, because I have quite a lot of variables (features). But I don't think there is an argument for a case like this.

Upvotes: 2

Views: 2043

Answers (1)

Patrick FitzGerald
Patrick FitzGerald

Reputation: 3630

I am guessing that what you are looking for is a balloon plot, also known as a matrix bubble chart or a categorical bubble plot. To my knowledge, seaborn does not provide this type of plot as of version 0.11.0 so using pairplot is currently not an option. I know of two functions that provide this type of plot displaying a single categorical-to-categorical relationship with a selected numerical variable for the size of the markers: this one in the pygal package and catscatter. But the downside is that both of these require that you have the count of observations as a column in your dataset, which I assume is not your case.

Here is a way to create a balloon plot displaying the count of observations grouped by two categorical variables contained in a pandas dataframe:

import pandas as pd                # v 1.1.3
import matplotlib.pyplot as plt    # v 3.3.2
import seaborn as sns              # v 0.11.0

# Import seaborn sample dataset stored as a pandas dataframe and select
# the categorical variables to plot
df = sns.load_dataset('titanic')
x = 'who'  # contains 3 unique values: 'child', 'man', 'woman'
y = 'embark_town'  # contains 3 unique values: 'Southampton', 'Queenstown', 'Cherbourg'

# Compute the counts of observations
df_counts = df.groupby([x, y]).size().reset_index()
df_counts.columns.values[df_counts.columns == 0] = 'count'

# Compute a size variable for the markers so that they have a good size regardless
# of the total count and the number of unique values in each categorical variable
scale = 500*df_counts['count'].size
size = df_counts['count']/df_counts['count'].sum()*scale

# Create matplotlib scatter plot with additional formatting
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(x, y, size, data=df_counts, zorder=2)
ax.grid(color='grey', linestyle='--', alpha=0.4, zorder=1)
ax.tick_params(length=0)
ax.set_frame_on(False)
ax.margins(.3)

balloon plot

Sources of inspiration: catscatter, this answer

Upvotes: 1

Related Questions