bambo222
bambo222

Reputation: 429

Sampling from a Computed Multivariate kernel density estimation

Say I have X and Y coordinates on a map and a non-parametric distribution of "hot zones" (e.g. degree of pollution on a geographic map positioned at X and Y coordinates). My input data are heat maps.

I want to train a machine learning model that learns what a "hot zone" looks like, but I don't have a lot of labeled examples. All "hot zones" look pretty similar, but may be in different parts of my standardized XY coordinate map.

I can calculate a multivariate KDE and plot the density maps accordingly. To generate synthetic labeled data, can I "reverse" the KDE and randomly generate new image files with observations that fall within my KDE's "dense" range?

Is there any way to do this in python?

Upvotes: 2

Views: 2686

Answers (1)

sascha
sascha

Reputation: 33532

There are at least 3 high-quality kernel-density estimation implementations available for python:

My personal ranking is statsmodels > scikit-learn > scipy (best to worst) but it will depend on your use-case.

Some random-remarks:

  • scikit-learn offers sampling from a fitted KDE for free (kde.sample(N))
  • scikit-learn offers good cross-validation functions based on grid-search or random-search (cross-validation is highly recommended)
  • statsmodels offers cross-validation methods based on optimization (can be slow for big datasets; but very high accuracy)

There are much more differences and some of these were analyzed in this very good blog post by Jake VanderPlas. The following table is an excerpt from this post:

From: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/ (Jake VanderPlas) From: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/ (Author: Jake VanderPlas)

Here is some example code using scikit-learn:

from sklearn.datasets import make_blobs
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import numpy as np

# Create test-data
data_x, data_y = make_blobs(n_samples=100, n_features=2, centers=7, cluster_std=0.5, random_state=0)

# Fit KDE (cross-validation used!)
params = {'bandwidth': np.logspace(-1, 2, 30)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(data_x)
kde = grid.best_estimator_
bandwidth = grid.best_params_['bandwidth']

# Resample
N_POINTS_RESAMPLE = 1000
resampled = kde.sample(N_POINTS_RESAMPLE)

# Plot original data vs. resampled
fig, axs = plt.subplots(2, 2, sharex=True, sharey=True)

for i in range(100):
    axs[0,0].scatter(*data_x[i])
axs[0,1].hexbin(data_x[:, 0], data_x[:, 1], gridsize=20)

for i in range(N_POINTS_RESAMPLE):
    axs[1,0].scatter(*resampled[i])
axs[1,1].hexbin(resampled[:, 0], resampled[:, 1], gridsize=20)

plt.show()

Output:

enter image description here

Upvotes: 7

Related Questions