Mourad BENKDOUR
Mourad BENKDOUR

Reputation: 1033

Seaborn Plot Distribution with histogram with stat = density or probability?

I know that, by default histogram approach is to count the number of occurrences. Instead, we can visualize the distribution with density or probability.

sns.displot(data, stat = 'density')

or

sns.displot(data, stat = 'probability')

my question is which cases I should use stat = 'density' or stat = 'probability' ?

Upvotes: 1

Views: 8586

Answers (1)

MagnusO_O
MagnusO_O

Reputation: 1283

stat = 'density' creates a probability density function (PDF) (Wikipedia).
As JohanC mentioned in the comments a key aspect of a PDF is that the area under the curve (or all bars together) is 1. So the bars width is taken into account for along with the bars height.

stat = 'probability' creates the same bars (incl. their same width) but each height (y axis value) directly states the probability of that bin. And the sum of all the bars heights is 1.


Which one to use kinda depends on what you want to 'show' with your plot and what's the audience.

'probability' is more intuitive and is understandable for stacked bars as well.
'density' is better suited for expert audience that is familiar with PDF.

Also since PDFs usually display a continuous curve 'density' with displot and bins is better suited for a larger amount of bins, while 'probability' with displot works intuitive also for e.g. 2 bins.


Seaborn tutorial Visualizing distributions of data - Normalized histogram statistics provides explanations and example plots.
To visualize the statements from this answer reduced example data and plots along with a different angle of explanation are used in the following.


data preparation: (df conversion is kept basic - to have the # print for easy cross check)

import pandas as pd
import seaborn as sns


penguins = sns.load_dataset("penguins")
penguins_strip = penguins[['flipper_length_mm', 'sex']].dropna()
# print(penguins_strip)
print('Female and Male')
print(f'range: {penguins_strip["flipper_length_mm"].max() - penguins_strip["flipper_length_mm"].min()}')
print(f'len: {len(penguins_strip)}')

penguins_strip_male = penguins_strip[penguins_strip['sex'] == 'Male']
# print(penguins_strip_male)
print('Male only')
print(f'range: {penguins_strip_male["flipper_length_mm"].max() - penguins_strip_male["flipper_length_mm"].min()}')
print(f'len: {len(penguins_strip_male)}')
Female and Male
range: 59.0
len: 333

Male only
range: 53.0
len: 168

A function displaying values on top of the displot bars - heavily based on that answer from Trenton McKinney

def show_values(plot):
    for ax in plot.axes.ravel():
        # add annotations
        for c in ax.containers:
            # custom label calculates percent and add an empty string so 0 value bars don't have a number
            labels = [f'{w:0.5f}' if (w := v.get_height()) > 0 else '' for v in c]
            ax.bar_label(c, labels=labels, label_type='edge', fontsize=8, rotation=0, padding=2)
        ax.margins(y=0.2)

Note: Due to the limited displayed float digits some of the following calculations are rounded.


2 bins, 'Male' flippers only

Default displot (without stat):

enter image description here

'probability' plot - note the intuitive y-axis probability for each bin that add up to 1.

enter image description here

'density' plot - see area calculations below

enter image description here

0.02156 * (53/2) = 0.57134
0.01617 * (53/2) = 0.428505
# see data preparation above, range is 53, and it's 2 bins

Adding these two areas up is 1 (rounding aside).
You can try bins_nr = 1 and check the area easily for that. While for 'probability' with bins_nr = 1 y will just be 1.

Code of the plots

bins_nr = 2

displot_default = sns.displot(penguins_strip_male, x="flipper_length_mm", hue="sex", 
                              bins=bins_nr, multiple="dodge")
show_values(displot_default)
    
displot_density = sns.displot(penguins_strip_male, x="flipper_length_mm", hue="sex", 
                              bins=bins_nr, multiple="dodge", stat = 'density')
show_values(displot_density)
        
displot_probability = sns.displot(penguins_strip_male, x="flipper_length_mm", hue="sex", 
                                  bins=bins_nr, multiple="dodge", stat = 'probability')
show_values(displot_probability)

Stacked plot example (only feasible for 'probability')

enter image description here

displot_probability_stacked = sns.displot(penguins_strip, x="flipper_length_mm", hue="sex", 
                                  bins=bins_nr, multiple="stack", stat = 'probability')
show_values(displot_probability_stacked)

Addon: In case you wonder about the common_norm example from the tutorial check

displot_density = sns.displot(penguins_strip, x="flipper_length_mm", hue="sex", 
                              bins=bins_nr, multiple="dodge", stat = 'density')
show_values(displot_density)

displot_density_common = sns.displot(penguins_strip, x="flipper_length_mm", hue="sex", bins=bins_nr, 
                multiple="dodge", stat = 'density', common_norm=False)
show_values(displot_density_common)

and calculate the areas.

Upvotes: 2

Related Questions