Reputation: 169
I am trying to estimate a probability density function (PDF) using sklearn.neighbors.KernelDensity. However, I don't know the optimum value to use for the bandwidth. Therefore, I am using sklearn.model_selection.GridSearchCV to calculate the optimum bandwidth (I got the idea from reading this).
My code is really slow, especially if I increase the number of data points above 1000. I would like to have the number of data points be of the order 10^6. Do you know how I could speed the code up? Or do you know another way I can estimate the optimum bandwidth?
I have attached an example code. Here the PDF is a Gaussian. However, in reality, I don't know the PDF and need to calculate an estimate.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV, LeaveOneOut
mu = 0
sigma = 1
num_markers = int(1e3)
marker_positions = np.random.normal(mu, sigma, num_markers)
x_min = -3
x_max = 3
nx = 101
x = np.linspace(x_min, x_max, nx)
bandwidths = 10 ** np.linspace(-1, 0, 5)
grid = GridSearchCV(KernelDensity(kernel='gaussian'), \
{'bandwidth': bandwidths}, \
cv=LeaveOneOut(), \
n_jobs = -1, \
verbose = 1)
grid.fit(marker_positions[:, None])
kde = grid.best_estimator_.fit(marker_positions[:, None])
log_density = kde.score_samples(x[:, None])
fig, ax = plt.subplots()
ax.plot(x, np.exp(-x**2/2) / np.sqrt(2 * np.pi), label = 'Exact')
ax.plot(x, np.exp(log_density), label = 'Estimate')
ax.set_ylabel('Probability density')
ax.set_xlabel('x')
ax.legend()
plt.savefig('probability_density_estimate.png')
Here is output:
Fitting 1000 folds for each of 5 candidates, totalling 5000 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 257 tasks | elapsed: 2.2s
[Parallel(n_jobs=-1)]: Done 757 tasks | elapsed: 4.8s
[Parallel(n_jobs=-1)]: Done 1457 tasks | elapsed: 8.7s
[Parallel(n_jobs=-1)]: Done 2357 tasks | elapsed: 14.6s
[Parallel(n_jobs=-1)]: Done 3457 tasks | elapsed: 21.4s
[Parallel(n_jobs=-1)]: Done 4757 tasks | elapsed: 29.4s
[Parallel(n_jobs=-1)]: Done 4969 out of 5000 | elapsed: 30.7s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed: 30.8s finished
Upvotes: 2
Views: 1818
Reputation: 325
There are a number of ways to choose a bandwidth. Cross validation is one, but for a much faster approach there are some rules of thumb you can refer to. SciPy's implementation of gaussian kernel density estimation (scipy.stats.gaussian_kde
) has two such rules of thumb built in:
1. Scott's Rule: n**(-1. / (d + 4))
2. Silverman's Rule: (n * (d + 2) / 4.)**(-1. / (d + 4)).
In both of these, d
is the dimensionality of your data, and n
is the number of data points you have. For weighted datasets, you can replace n
in these equations with:
neff = sum(weights)^2 / sum(weights^2)
All of this comes directly from the linked page; there's a bit more detail there, and links to references. Hopefully one of these rules of thumb can be useful to you, or at least the linked page might be able to lead you down a fruitful path!
Upvotes: 2