Peanutlex
Peanutlex

Reputation: 169

How to speed up the kernel density estimation using sklearn?

I am trying to estimate a probability density function (PDF) using sklearn.neighbors.KernelDensity. However, I don't know the optimum value to use for the bandwidth. Therefore, I am using sklearn.model_selection.GridSearchCV to calculate the optimum bandwidth (I got the idea from reading this).

My code is really slow, especially if I increase the number of data points above 1000. I would like to have the number of data points be of the order 10^6. Do you know how I could speed the code up? Or do you know another way I can estimate the optimum bandwidth?

I have attached an example code. Here the PDF is a Gaussian. However, in reality, I don't know the PDF and need to calculate an estimate.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV, LeaveOneOut

mu = 0
sigma = 1
num_markers = int(1e3)
marker_positions = np.random.normal(mu, sigma, num_markers)

x_min = -3
x_max = 3
nx = 101
x = np.linspace(x_min, x_max, nx)

bandwidths = 10 ** np.linspace(-1, 0, 5)
grid = GridSearchCV(KernelDensity(kernel='gaussian'), \
                    {'bandwidth': bandwidths}, \
                    cv=LeaveOneOut(), \
                    n_jobs = -1, \
                    verbose = 1)
grid.fit(marker_positions[:, None])

kde = grid.best_estimator_.fit(marker_positions[:, None])
log_density = kde.score_samples(x[:, None])

fig, ax = plt.subplots()
ax.plot(x, np.exp(-x**2/2) / np.sqrt(2 * np.pi), label = 'Exact')
ax.plot(x, np.exp(log_density), label = 'Estimate')
ax.set_ylabel('Probability density')
ax.set_xlabel('x')
ax.legend()
plt.savefig('probability_density_estimate.png')

Here is output:

Fitting 1000 folds for each of 5 candidates, totalling 5000 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 257 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 757 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 1457 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 2357 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-1)]: Done 3457 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-1)]: Done 4757 tasks      | elapsed:   29.4s
[Parallel(n_jobs=-1)]: Done 4969 out of 5000 | elapsed:   30.7s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed:   30.8s finished

and graph: enter image description here

Upvotes: 2

Views: 1818

Answers (1)

realityChemist
realityChemist

Reputation: 325

There are a number of ways to choose a bandwidth. Cross validation is one, but for a much faster approach there are some rules of thumb you can refer to. SciPy's implementation of gaussian kernel density estimation (scipy.stats.gaussian_kde) has two such rules of thumb built in:

1. Scott's Rule: n**(-1. / (d + 4))

2. Silverman's Rule: (n * (d + 2) / 4.)**(-1. / (d + 4)).

In both of these, d is the dimensionality of your data, and n is the number of data points you have. For weighted datasets, you can replace n in these equations with:

neff = sum(weights)^2 / sum(weights^2)

All of this comes directly from the linked page; there's a bit more detail there, and links to references. Hopefully one of these rules of thumb can be useful to you, or at least the linked page might be able to lead you down a fruitful path!

Upvotes: 2

Related Questions