quant
quant

Reputation: 4482

range of x-axis of kdeplot in seaborn is different than in data

I am plotting a kdeplot using

import seaborn as sns 
colors = ['r','g','b']
i = 0
for v in dt.var.unique():
    p1 = sns.kdeplot(dt.query('var == @v')['val'], shade=True, color=colors[i], legend=None).get_figure()
    i += 1

The dt.val.max() and the dt.val.min() are 350 and 0 respectively.

But the plot looks like this

enter image description here

I don't understand why the x-axis ranges is not in accordance with the data.

Upvotes: 3

Views: 2530

Answers (1)

JohanC
JohanC

Reputation: 80329

The kde puts a gaussian bell shape over each of the data points and sums all those shapes. The width of the shape depends on the number of points (or can be given as a parameter) and the variance of the data. When there are fewer sample points, the bell shapes get wider. Probably your red curve has few sample points, and most of them are close to 0 or 350.

Currently, seaborn uses scipy.statsmodels.nonparametric.kde.KDEUnivariate with the formula 1.059 * std(samples) * len(samples) ** (-1/5) for the width of the gaussian normal.

In general, a kdeplot is meant for continuous distributions with enough sample points and supposing the probability density function is rather smooth.

The following code tries to illustrate how the kde curve is calculated as the sum of the individual gaussian curves, starting from a simplified distribution of sample points. These sample points give rise to a kde curve resembling the red curve of the example.

from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats

values = [0, 200, 300, 350]
repeats = [100, 25, 35, 40]
samples = np.repeat(values, repeats)
sns.kdeplot(samples, shade=False, color='crimson', label='kdeplot')

sigma = 1.059 * samples.std() * len(samples) ** (-1/5.)
x = np.linspace(-150, 500, 500)
for val, rep in zip(values, repeats):
    f = stats.norm.pdf(x, val, sigma)
    plt.plot(x, f * rep / len(samples), ls=':', label=f'value: {val} freq: {rep}')
plt.ylim(ymin=0)
plt.legend()
plt.show()

resulting plot

Upvotes: 3

Related Questions