Adam
Adam

Reputation: 141

gaussian_kde with skewed distributions?

I need to do kernel density estimation on data that were generated from a lognormal distribution. I've been using gaussian_kde and plotting the data with matplotlib in Python.

However, one problem is that the data have such extreme skew that it's difficult to properly graph the density of the distribution. In the example I have, most of the distribution is extremely close to 0, but due to the extreme skew, the density estimates ends up getting distributed much further up on the x axis than they should be. I can get better resolution if I up the bin size, but this takes an extremely long time to do.

Does anybody know any solutions to this? Does this require a different selection of bandwidth?

Here's some example code where I generated data:

k = np.random.normal(loc = -15, scale = 6, size = 10e3)
k = exp(k)
xs = np.linspace(min(k), max(k), 2500)
density = gaussian_kde(k)
d = density(xs)
plot(xs, d)
xlim(0, 5)

Density is distributed fairly evenly, and yet when takes the median of k, it is virtually zero.

Does anybody have any solutions to this? Thanks!

Upvotes: 1

Views: 1721

Answers (1)

Josef
Josef

Reputation: 22897

Yes, the automatic bandwidth choice of gaussian_kde doesn't work in this case.

The bandwidth choice of gaussian_kde is based on an estimate of the variance. The variance is very large in this case because of a few very large observations. A better choice would be a variance estimate based on MAD, median absolute deviation.

>>> k.var()
20221.822015723094
>>> k.max()
12400.294578835219
>>> import statsmodels.api as sm
>>> sm.robust.scale.mad(k)
4.7202445521441112e-07

The default bandwidth in statsmodels is in this case based on MAD:

>>> kde = sm.nonparametric.KDEUnivariate(k)
>>> kde.fit()
>>> kde.bw     # selected default bandwidth
2.3879089644581783e-06

This would match the large concentration of observations close to zero. You can set the bandwidth in gaussian_kde instead of using the default bandwidth.

However, this bandwidth will be very small for the tail which will have only very few observations and with large distances in between those. For that part the bandwidth should be large.

However, gaussian_kde is not able to handle an adaptive bandwidth, neither do the kernel density estimators in statsmodels.

Upvotes: 2

Related Questions