Reputation: 2146
I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)
My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde
df[var].hist()
plt.show() # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys) # a pdf with kinks
Alternatively, is there a slick way to use
count, div = np.histogram(df[var])
and then scale the count array to apply kde() to it?
Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used
df[var].hist(bins=100)
I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.
Upvotes: 2
Views: 3307
Reputation: 2146
The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.
See also: http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width
Upvotes: 2
Reputation: 879471
If you increase the bandwidth using the bw_method
parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method
:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()
yields
Upvotes: 2