Lizardinablizzard
Lizardinablizzard

Reputation: 55

Multivariate KDE Scipy Stats - what if it's not Gaussian?

I have some 2D data that I am smoothing using:

from scipy.stats import gaussian_kde
kde = gaussian_kde(data)

but what if my data isn't Gaussian/tophat/the other options? Mine looks more elliptical before smoothing, so should I really have a different bandwidth in x and then y? The variance in one direction is a lot higher, and also the values of the x axis are higher, so it feels like a simple Gaussian might miss something?

Upvotes: 0

Views: 2658

Answers (1)

Max Pierini
Max Pierini

Reputation: 2249

This is what I get with your defined X and Y. Seems good. Were you expecting something different?

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def generate(n):
    # generate data
    np.random.seed(42)
    x = np.random.normal(size=n, loc=1, scale=0.01)
    np.random.seed(1)
    y = np.random.normal(size=n, loc=200, scale=100)
    return x, y

x, y = generate(100)
xmin = x.min()
xmax = x.max()
ymin = y.min()
ymax = y.max()

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([x, y])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)

fig, ax = plt.subplots(figsize=(7, 7))
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
          extent=[xmin, xmax, ymin, ymax],
          aspect='auto', alpha=.75
         )
ax.plot(x, y, 'ko', ms=5)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
plt.show()

enter image description here

The distributions of x and y are Gaussian. You can verify with seaborn too

import pandas as pd
import seaborn as sns
# I pass a DataFrame because passing
# (x,y) alone will be soon deprecated
g = sns.jointplot(data=pd.DataFrame({'x':x, 'y':y}), x='x', y='y')
g.plot_joint(sns.kdeplot, color="r", zorder=0, levels=6)

enter image description here


update

Kernel Density Estimate of 2-dimensional data is done separately along each axis and then join together.

Let's make an example with the dataset we already used.

As we can see in the seaborn jointplot, you have not only the estimated 2d-kde but also marginal distributions of x and y (the histograms).

enter image description here

So, step by step, let's estimate the density of x and y and then evaluate the density over a linearspace

kde_x = sps.gaussian_kde(x)
kde_x_space = np.linspace(x.min(), x.max(), 100)
kde_x_eval = kde_x.evaluate(kde_x_space)
kde_x_eval /= kde_x_eval.sum()

kde_y = sps.gaussian_kde(y)
kde_y_space = np.linspace(y.min(), y.max(), 100)
kde_y_eval = kde_y.evaluate(kde_y_space)
kde_y_eval /= kde_y_eval.sum()

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].plot(kde_x_space, kde_x_eval, 'k.')
ax[0].set(title='KDE of x')
ax[1].plot(kde_y_space, kde_y_eval, 'k.')
ax[1].set(title='KDE of y')
plt.show()

enter image description here

So we now have the marginal distributions of x and y. These are probability density functions so, the joint-probability of x and y can be seen as the intersection of independent events x and y, thus we can multiply the estimated probability density of x and y in a 2d-matrix and plot on 3d projection

# Grid of x and y
X, Y = np.meshgrid(kde_x_space, kde_y_space)
# Grid of probability density
kX, kY = np.meshgrid(kde_x_eval, kde_y_eval)
# Intersection
Z = kX * kY

fig, ax = plt.subplots(
    2, 2, 
    subplot_kw={"projection": "3d"}, 
    figsize=(10, 10))

for i, (elev, anim, title) in enumerate(zip([10, 10, 25, 25], 
                                            [0, -90, 25, -25],
                                            ['y axis', 'x axis', 'view 1', 'view 2']
                                            )):
    # Plot the surface.
    surf = ax.flat[i].plot_surface(X, Y, Z, cmap=plt.cm.gist_earth_r,
                           linewidth=0, antialiased=False, alpha=.75)
    ax.flat[i].scatter(x, y, zs=0, zdir='z', c='k')
    ax.flat[i].set(
        xlabel='x', ylabel='y',
        title=title
    )
    ax.flat[i].view_init(elev=elev, azim=anim)
plt.show()

enter image description here

This is a very simple and naif method but only to have an idea on how it works and why x and y scales don't matter for a 2d-KDE.

Upvotes: 1

Related Questions