How to speed up the kernel density estimation using sklearn?

Question

I am trying to estimate a probability density function (PDF) using sklearn.neighbors.KernelDensity. However, I don't know the optimum value to use for the bandwidth. Therefore, I am using sklearn.model_selection.GridSearchCV to calculate the optimum bandwidth (I got the idea from reading this).

My code is really slow, especially if I increase the number of data points above 1000. I would like to have the number of data points be of the order 10^6. Do you know how I could speed the code up? Or do you know another way I can estimate the optimum bandwidth?

I have attached an example code. Here the PDF is a Gaussian. However, in reality, I don't know the PDF and need to calculate an estimate.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV, LeaveOneOut

mu = 0
sigma = 1
num_markers = int(1e3)
marker_positions = np.random.normal(mu, sigma, num_markers)

x_min = -3
x_max = 3
nx = 101
x = np.linspace(x_min, x_max, nx)

bandwidths = 10 ** np.linspace(-1, 0, 5)
grid = GridSearchCV(KernelDensity(kernel='gaussian'), \
                    {'bandwidth': bandwidths}, \
                    cv=LeaveOneOut(), \
                    n_jobs = -1, \
                    verbose = 1)
grid.fit(marker_positions[:, None])

kde = grid.best_estimator_.fit(marker_positions[:, None])
log_density = kde.score_samples(x[:, None])

fig, ax = plt.subplots()
ax.plot(x, np.exp(-x**2/2) / np.sqrt(2 * np.pi), label = 'Exact')
ax.plot(x, np.exp(log_density), label = 'Estimate')
ax.set_ylabel('Probability density')
ax.set_xlabel('x')
ax.legend()
plt.savefig('probability_density_estimate.png')

Here is output:

Fitting 1000 folds for each of 5 candidates, totalling 5000 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 257 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 757 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 1457 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 2357 tasks      | elapsed:   14.6s
[Parallel(n_jobs=-1)]: Done 3457 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-1)]: Done 4757 tasks      | elapsed:   29.4s
[Parallel(n_jobs=-1)]: Done 4969 out of 5000 | elapsed:   30.7s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed:   30.8s finished

and graph:

How to speed up the kernel density estimation using sklearn?

Answers (1)

Related Questions