UserR6
UserR6

Reputation: 513

Plotting a gaussian fit to a histogram in displot or histplot

How can I plot a gaussian fit onto a histplot, as previously done by the deprecated distplot?

import seaborn as sns
import numpy as np
from scipy.stats import norm
x = np.random.normal(size=500) * 0.1

With distplot I could do:

sns.distplot(x, kde=False, fit=norm)

enter image description here

But how to go about it in displot or histplot?

So far the closest I've come to is:

sns.histplot(x,stat="probability", bins=30, kde=True, kde_kws={"bw_adjust":3})

enter image description here

But I think this just increases the smoothening of the plotted kde, which isn't exactly what I'm going for.

Upvotes: 17

Views: 11274

Answers (3)

cottontail
cottontail

Reputation: 23021

distplot's source code regarding fit= parameter is very similar to what the other answers here already suggested; initialize some support array, compute PDF values from it using the mean/std of the given data and superimpose a line plot on top of the histogram. We can directly "transcribe" the relevant part of the code into a custom function and use it to plot a gaussian fit (doesn't have to be normal; could be any continuous distribution).

An example implementation is as follows.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

def add_fit_to_histplot(a, fit=stats.norm, ax=None):

    if ax is None:
        ax = plt.gca()

    # compute bandwidth
    bw = len(a)**(-1/5) * a.std(ddof=1)
    # initialize PDF support
    x = np.linspace(a.min()-bw*3, a.max()+bw*3, 200)
    # compute PDF parameters
    params = fit.fit(a)
    # compute PDF values
    y = fit.pdf(x, *params)
    # plot the fitted continuous distribution
    ax.plot(x, y, color='#282828')
    return ax

# sample data
x = np.random.default_rng(0).normal(1, 4, size=500) * 0.1

# plot histogram with gaussian fit
sns.histplot(x, stat='density')
add_fit_to_histplot(x, fit=stats.norm);

first iteration

If you don't fancy the black edge colors or the colors in general, we can change bar colors, edge colors and the alpha parameter to make the histplot() output the same as the default style output of the deprecated distplot().

import numpy as np

# sample data
x = np.random.default_rng(0).normal(1, 4, size=500) * 0.1

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))

# left subplot
sns.distplot(x, kde=False, fit=stats.norm, ax=ax1)
ax1.set_title('Using distplot')

# right subplot
sns.histplot(x, stat='density', color='#1f77b4', alpha=0.4, edgecolor='none', ax=ax2)
add_fit_to_histplot(x, fit=stats.norm, ax=ax2)
ax2.set_title('Using histplot+fit');

result


This answer differs from the existing answers (1, 2) because it fits a gaussian (or any other continuous distribution e.g. gamma) on the histogram where there is data (which is also how the fit is plotted in distplot()). The aim is to replicate distplot()'s fit functionality as much as possible.

For example, say, you have data that follows the Poisson distribution, plot its histogram and plot a gaussian fit to it. With add_fit_to_histplot(), because the support is tied to the data endpoints (and uses Scott's rule for bandwidth), the resulting gaussian fit plot is drawn only where there is corresponding data on the histogram, which is also how it's drawn using distplot() (the left subplot below). On the other hand, ohtotasche's normal() function plots even if there isn't corresponding data, i.e. the left tail of the normal pdf is drawn fully (the right subplot below).

data = np.random.default_rng(0).poisson(0.5, size=500)

fig, (a1, a2) = plt.subplots(1, 2, facecolor='white', figsize=(10,4))

# left subplot
sns.histplot(data, stat='density', color='#1f77b4', alpha=0.4, edgecolor='none', ax=a1)
add_fit_to_histplot(data, fit=stats.norm, ax=a1)
a1.set_title("With add_fit_to_histplot")

# right subplot
sns.histplot(x=data, stat="density", ax=a2)
normal(data.mean(), data.std())
a2.set_title("With ohtotasche's normal function")

difference

Upvotes: 5

ohtotasche
ohtotasche

Reputation: 508

I really miss the fit parameter too. It doesn't appear they replaced that functionality when they deprecated the distplot function. Until they plug that hole, I created a short function to add the normal distribution overlay to my histplot. I just paste the function at the top of a file along with the imports, and then I just have to add one line to add the overlay when I want it.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

def normal(mean, std, color="black"):
    x = np.linspace(mean-4*std, mean+4*std, 200)
    p = stats.norm.pdf(x, mean, std)
    z = plt.plot(x, p, color, linewidth=2)

data = np.random.normal(size=500) * 0.1    
ax = sns.histplot(x=data, stat="density")
normal(data.mean(), data.std())

enter image description here

If you would rather use stat="probability" instead of stat="density", you can normalize the fit curve with something like this:

def normal(mean, std, histmax=False, color="black"):
    x = np.linspace(mean-4*std, mean+4*std, 200)
    p = stats.norm.pdf(x, mean, std)
    if histmax:
        p = p*histmax/max(p)
    z = plt.plot(x, p, color, linewidth=2)

data = np.random.normal(size=500) * 0.1    
ax = sns.histplot(x=data, stat="probability")
normal(data.mean(), data.std(), histmax=ax.get_ylim()[1])

Upvotes: 12

Regi Mathew
Regi Mathew

Reputation: 2873

Sorry I am late to the party. Just check if this will meet your requirement.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

data = np.random.normal(size=500) * 0.1
mu, std = norm.fit(data)

# Plot the histogram.
plt.hist(data, bins=25, density=True, alpha=0.6, color='g')

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
plt.show()

enter image description here

Upvotes: 4

Related Questions