Cumulative probability of estimated empirical distribution for n-dimensional data

Question

The problem

I have a dataset with 4 numeric features and 1000 datapoints. The distribution of the values is unknown (numpy randint generates uniform ints, but this is just for the purpose of illustration). Given new datapoint (4 numbers) I want to find what is the cumulative probability (single number) of this specific datapoint.

import numpy as np

data = np.random.randint(1, 100, size=(1000, 4))
array([[28, 52, 91, 66],
       [78, 94, 95, 12],
       [60, 63, 43, 37],
       ...,
       [81, 68, 45, 46],
       [14, 38, 91, 46],
       [37, 51, 68, 97]])

new_data = np.random.randint(1, 100, size=(1, 4))
array([[75, 24, 39, 94]])

I've tried:

Scipy

Can estimate pdf, do not know how to estimate cumulative probability. Possible ways are monte-carlo sim or integration (scipy.integrate.nquad) which is too slow for my case Integrate 2D kernel density estimate.

import scipy.stats
kde = scipy.stats.gaussian_kde(data.T)
kde.pdf(new_data)

Scikit-learn

Same as above, do not know how to estimate cumulative probability.

from sklearn.neighbors import KernelDensity
model = KernelDensity()
model.fit(data)
np.exp(model.score_samples(new_data))

Statsmodels

Can not archive anything as this only accept 1d data.

from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(data[:, 0])
ecdf(new_data[0][0])

The question is, is there a fast and efficient way to estimate cumulative probability of a 4-dimentional datapoint having the provided scipy or sklearn (preferably) models?

Am I moving in the right direction or is there a completely different way to solve this? Maybe variational autoencoders is the way to go? Are there simple ways to solve this?

Josef · Accepted Answer

A multivariate ecdf at a point would just compute the fraction of observations with values smaller than the point.

Something like the following

np.random.seed(0)
data = np.random.randint(1, 100, size=(1000, 4))
new_data = np.random.randint(1, 100, size=(2, 4))

def ecdf_mv(new_data, data):
    new_data = np.atleast_2d(new_data)
    ecdf = []
    for row in new_data:
        ecdf.append((data <= row).all(1).mean())

    return np.asarray(ecdf)

ecdf_mv(new_data, data)

array([0.039, 0.002])

some checks:

ecdf_mv(np.ones(4) * 100 / 2, data), 0.5**4
(array([0.067]), 0.0625)

marginal = 100 * np.ones((4, 4)) - 50 * np.eye(4)
ecdf_mv(marginal, data)
array([0.521, 0.515, 0.502, 0.54 ])

In the univariate case we can sort the data to get a fast algorithm to compute the ecdf at the original points.
I don't know if there is a data structure or algorithm that is computationally more efficient than the brute force comparison, if the ecdf has to be evaluated at many points.

Cumulative probability of estimated empirical distribution for n-dimensional data

The problem

I've tried:

Answers (2)

Related Questions