deepdebugging
deepdebugging

Reputation: 115

Generating random numbers for testing a kernel density estimate

I have a one dimensional dataset from which I have calculated the kernel density estimate, using statsmodels and also with scipy, in python.

I want to generate a set of random numbers to see if they give me the same distribution or not.

The solution I could find so far was this :

To draw a number x between 0 and 1 and return cdf^{-1}(x), where cdf^{-1} is the inverse cumulative distribution function of 'f'.

Problem with this is, using statsmodel I am able to find the inverse cumulative distribution, but in the form of a matrix. So multiplying a number x with the matrix is not giving me a random sample from the distribution of the calculated KDE.

How can I generate a random number from the given KDE function or distribution?

EDIT: Here is a code sample:

def calcKDE(data):
    #Calculating KDE
    kde = sm.nonparametric.KDEUnivariate(data)
    kde.fit(kernel='gau', bw='silverman')
    return kde
def generateData(data, kde):
    inverse_cdf = kde.icdf // this is a method which takes no parameters, and so is kde.cdf
    randomSet = np.random.random(1) * inverse_cdf // inverse_cdf is taken as a matrix, will also add a loop here to return 1000 random values
    return randomSet

Upvotes: 3

Views: 3326

Answers (1)

sascha
sascha

Reputation: 33532

The inverse-transform-sampling approach looks like this:

from scipy.optimize import brentq
import statsmodels.api as sm
import numpy as np

# fit
kde = sm.nonparametric.KDEMultivariate()  # ... you already did this

# sample
u = np.random.random()

# 1-d root-finding
def func(x):
    return kde.cdf([x]) - u
sample_x = brentq(func, -99999999, 99999999)  # read brentq-docs about these constants
                                              # constants need to be sign-changing for the function

I recommend implementing this class-based, with a customized sample-function for better usage.

This approach is also very generic and there are much much faster approaches using additional information about the kernels in use (no inverse-transform sampling). You should find some examples by googling.

Another remark:

My ordering of kde-tools from best to worst (my opinion):

  • statsmodels (nice features like dependent-variables; optimization-based CV)
  • scikit-learn (easy to use; sample-function; grid- and random-search based CV)
  • scipy (don't like it much, but fft-based approaches for some usages -> fast)

Upvotes: 6

Related Questions