Quetzalcoatl
Quetzalcoatl

Reputation: 2146

Python/Scipy kde fit, scaling

I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)

My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde

df[var].hist()
plt.show()  # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys)  # a pdf with kinks

Alternatively, is there a slick way to use

count, div = np.histogram(df[var])

and then scale the count array to apply kde() to it?

original historgram

kde_fit

Update

Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used

df[var].hist(bins=100)

I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.

enter image description here

Upvotes: 2

Views: 3307

Answers (2)

Quetzalcoatl
Quetzalcoatl

Reputation: 2146

The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.

See also: http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

Upvotes: 2

unutbu
unutbu

Reputation: 879471

If you increase the bandwidth using the bw_method parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde

data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()

yields

enter image description here

Upvotes: 2

Related Questions