GeauxEric
GeauxEric

Reputation: 3070

use scipy.stats to automatically fit and use the parameter in pdf calculation

I would like my program to automatically choose the distribution that has the best fitness and use this distribution's probability density function to calculate the probability

  1. Use scipy.stats.rv_continuous.fit to get the parameter of fitting, e.g.

    paras = scipy.stats.norm.fit(data_array)

  2. Use scipy.stats.kstest to test the fitness

    fitness = scipy.stats.kstest(data_array, paras)

  3. Choose the distribution that gives the lowest kstest score

  4. Calculate the probability, e.g.

    scipy.stats.norm.pdf(my_values, paras)

I am not sure whether this is a rigorously correct way to choose the best-fit distribution. Currently it works well for normal distribution.

My problem is how to parse the argument to scipy.stats.rv_continuous.pdf(). For some distributions there are three parameters calculated from scipy.stats.rv_continuous.fit(), including the shape, loc and scale. I tried to parse directly like

scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2])

this will give me two values for pdf for one point.

I also tried to parse in this way

scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2])

But the outcome is wierd. Does anybody ever want to do something like this and meet some problem of the same kind?

My goal is to replace the gaussian with any better distributions in the Naive Bayesian classification, in hope to improve the prediction accuracy.

Upvotes: 4

Views: 1817

Answers (1)

Matt Haberland
Matt Haberland

Reputation: 3873

My problem is how to parse the argument to scipy.stats.rv_continuous.pdf()

Interpreting this literally, it sounds like you are trying to use the pdf method of the scipy.stats.rv_continuous class, but the rv_continuous class must be subclassed and instantiated before its pdf method can be used.

For the rest, I'm assuming you're using rv_continuous as a variable that refers to a SciPy distribution, e.g. rv_continuous = stats.norm. You may want to skip to the code at the end, but I will address each of the statements that indicates a problem first.

I tried to parse directly like scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2]) this will give me two values for pdf for one point.

It difficult to debug this without knowing what distribution rv_continuous refers to and what my_values is. If rv_continuous is a variable that refers to a SciPy distribution, if my_values is a scalar, if paras is the output of rv_continuous.fit, and if rv_continuous has three parameters (including loc and scale), there will only be one output, so the problem must lie in information not included here.

I also tried to parse in this way scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2]) But the outcome is wierd.

Since this is identical to the previous way, we would expect it to have the same behavior. Please consider elaborating on what the output is because "weird" can mean many things.

In any case, it sounds like this code will help. For each of two different distributions, it fits the distribution to data, creates a frozen distribution from the fitted parameters, and computes the PDF at a point. The two distributions have diffferent numbers of parameters, so you can see that the code works regardless of how many parameters the distribution has.

import numpy as np
from scipy import stats

rng = np.random.default_rng()
data = rng.normal(size=1000)

for family_name in ['norm', 'skewnorm']:
  family = getattr(stats, family_name)
  params = family.fit(data)
  dist = family(*params)  # note use of `*` to automatically unpack `params`
  print(f"{family_name}{params}.pdf(1): {dist.pdf(1)}")

# norm(-0.004263933560864075, 0.9864850754623957).pdf(1): 0.2408655741640401
# skewnorm(-0.9815681271426395, 0.660439450142722, 1.1895346103612483).pdf(1): 0.25093359040244

Upvotes: 0

Related Questions