Reputation: 444
I have a dataset with 4 numeric features and 1000 datapoints. The distribution of the values is unknown (numpy randint generates uniform ints, but this is just for the purpose of illustration). Given new datapoint (4 numbers) I want to find what is the cumulative probability (single number) of this specific datapoint.
import numpy as np
data = np.random.randint(1, 100, size=(1000, 4))
array([[28, 52, 91, 66],
[78, 94, 95, 12],
[60, 63, 43, 37],
...,
[81, 68, 45, 46],
[14, 38, 91, 46],
[37, 51, 68, 97]])
new_data = np.random.randint(1, 100, size=(1, 4))
array([[75, 24, 39, 94]])
Scipy
Can estimate pdf, do not know how to estimate cumulative probability. Possible ways are monte-carlo sim or integration (scipy.integrate.nquad) which is too slow for my case Integrate 2D kernel density estimate.
import scipy.stats
kde = scipy.stats.gaussian_kde(data.T)
kde.pdf(new_data)
Scikit-learn
Same as above, do not know how to estimate cumulative probability.
from sklearn.neighbors import KernelDensity
model = KernelDensity()
model.fit(data)
np.exp(model.score_samples(new_data))
Statsmodels
Can not archive anything as this only accept 1d data.
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(data[:, 0])
ecdf(new_data[0][0])
The question is, is there a fast and efficient way to estimate cumulative probability of a 4-dimentional datapoint having the provided scipy or sklearn (preferably) models?
Am I moving in the right direction or is there a completely different way to solve this? Maybe variational autoencoders is the way to go? Are there simple ways to solve this?
Upvotes: 2
Views: 1714
Reputation: 493
Doing some try and error I found the following:
A pure numpy solution (Based on Josef's):
import numpy as np
def ecdf_mv(new_data, data):
rows = np.expand_dims(new_data, axis=1)
ecdf = (data < rows).all(axis=2).mean(axis=1)
return np.asarray(ecdf)
This will return the empirical cummulative distribution function for an array of points.
If instead the CDF over a multivarate KDE is desired, the following code could be used (this is much slower though):
from statsmodels.nonparametric.kernel_density import KDEMultivariate
def cdf_kde_mv(new_data, data, data_type):
data_kde = KDEMultivariate(data, var_type=data_type)
return data_kde.cdf(new_data)
Performance-wise the pure numpy could be faster but more memory-intensive and Josef's approach with the for-loop is faster under some situations.
For the "smothness" feeling, it is not dependent on "ECDF vs CDF over Multivariate KDE" but rather more towards which is your sample size and the number of bins. In the below example, the visualization looks "smooth" even without using KDEs, that is because the sample size is big enough for that number of bins. The interpolation that is interpreted as smothing is matplotlib plot_surface
. If you also need the smooth results analytically and not only for graph, consider using the KDE approach.
Applying this with some visualization:
import numpy as np
import matplotlib.pyplot as plt
sample_size = 10_000
bins_count= 20
sample = np.random.multivariate_normal([0, 0], [[1, 0.0], [0.0, 1]] , sample_size)
bins_edges = np.linspace(-4, 4, bins_count + 1)
bins_centers = (bins_edges[:-1] + bins_edges[1:]) / 2
X, Y = np.meshgrid(bins_centers, bins_centers)
x, y = X.ravel(), Y.ravel()
evaluation_points = np.stack([x, y], axis=1)
cummulative_probability = cdf_kde_mv(evaluation_points, sample, "cc")
# Alternatively:
cummulative_probability = ecdf_mv(evaluation_points, sample)
cummulative_probability = cummulative_probability.reshape(bins_count, bins_count)
# Plotting
fig = plt.figure(figsize=(10, 10), constrained_layout=True)
ax = fig.add_subplot(projection='3d')
ax.view_init(elev=20., azim=-135)
ax.plot_surface(X, Y, cummulative_probability, rcount=bins_count, ccount=bins_count,
antialiased=False, vmin=0, vmax=cummulative_probability.max(),
alpha=0.5, cmap='viridis')
ax.margins(0)
ax.set_zlim(0, 1)
plt.show()
Upvotes: 1
Reputation: 22897
A multivariate ecdf at a point would just compute the fraction of observations with values smaller than the point.
Something like the following
np.random.seed(0)
data = np.random.randint(1, 100, size=(1000, 4))
new_data = np.random.randint(1, 100, size=(2, 4))
def ecdf_mv(new_data, data):
new_data = np.atleast_2d(new_data)
ecdf = []
for row in new_data:
ecdf.append((data <= row).all(1).mean())
return np.asarray(ecdf)
ecdf_mv(new_data, data)
array([0.039, 0.002])
some checks:
ecdf_mv(np.ones(4) * 100 / 2, data), 0.5**4
(array([0.067]), 0.0625)
marginal = 100 * np.ones((4, 4)) - 50 * np.eye(4)
ecdf_mv(marginal, data)
array([0.521, 0.515, 0.502, 0.54 ])
In the univariate case we can sort the data to get a fast algorithm to compute the ecdf at the original points.
I don't know if there is a data structure or algorithm that is computationally more efficient than the brute force comparison, if the ecdf has to be evaluated at many points.
Upvotes: 3