Reputation: 429
Say I have X and Y coordinates on a map and a non-parametric distribution of "hot zones" (e.g. degree of pollution on a geographic map positioned at X and Y coordinates). My input data are heat maps.
I want to train a machine learning model that learns what a "hot zone" looks like, but I don't have a lot of labeled examples. All "hot zones" look pretty similar, but may be in different parts of my standardized XY coordinate map.
I can calculate a multivariate KDE and plot the density maps accordingly. To generate synthetic labeled data, can I "reverse" the KDE and randomly generate new image files with observations that fall within my KDE's "dense" range?
Is there any way to do this in python?
Upvotes: 2
Views: 2686
Reputation: 33532
There are at least 3 high-quality kernel-density estimation implementations available for python:
My personal ranking is statsmodels > scikit-learn > scipy (best to worst) but it will depend on your use-case.
Some random-remarks:
kde.sample(N)
)There are much more differences and some of these were analyzed in this very good blog post by Jake VanderPlas. The following table is an excerpt from this post:
From: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/ (Author: Jake VanderPlas)
from sklearn.datasets import make_blobs
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import numpy as np
# Create test-data
data_x, data_y = make_blobs(n_samples=100, n_features=2, centers=7, cluster_std=0.5, random_state=0)
# Fit KDE (cross-validation used!)
params = {'bandwidth': np.logspace(-1, 2, 30)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(data_x)
kde = grid.best_estimator_
bandwidth = grid.best_params_['bandwidth']
# Resample
N_POINTS_RESAMPLE = 1000
resampled = kde.sample(N_POINTS_RESAMPLE)
# Plot original data vs. resampled
fig, axs = plt.subplots(2, 2, sharex=True, sharey=True)
for i in range(100):
axs[0,0].scatter(*data_x[i])
axs[0,1].hexbin(data_x[:, 0], data_x[:, 1], gridsize=20)
for i in range(N_POINTS_RESAMPLE):
axs[1,0].scatter(*resampled[i])
axs[1,1].hexbin(resampled[:, 0], resampled[:, 1], gridsize=20)
plt.show()
Upvotes: 7