Reputation: 3070
I would like my program to automatically choose the distribution that has the best fitness and use this distribution's probability density function to calculate the probability
Use scipy.stats.rv_continuous.fit
to get the parameter of fitting, e.g.
paras = scipy.stats.norm.fit(data_array)
Use scipy.stats.kstest
to test the fitness
fitness = scipy.stats.kstest(data_array, paras)
Choose the distribution that gives the lowest kstest score
Calculate the probability, e.g.
scipy.stats.norm.pdf(my_values, paras)
I am not sure whether this is a rigorously correct way to choose the best-fit distribution. Currently it works well for normal distribution.
My problem is how to parse the argument to scipy.stats.rv_continuous.pdf()
. For some distributions there are three parameters calculated from scipy.stats.rv_continuous.fit()
, including the shape, loc and scale. I tried to parse directly like
scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2])
this will give me two values for pdf for one point.
I also tried to parse in this way
scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2])
But the outcome is wierd. Does anybody ever want to do something like this and meet some problem of the same kind?
My goal is to replace the gaussian with any better distributions in the Naive Bayesian classification, in hope to improve the prediction accuracy.
Upvotes: 4
Views: 1817
Reputation: 3873
My problem is how to parse the argument to scipy.stats.rv_continuous.pdf()
Interpreting this literally, it sounds like you are trying to use the pdf
method of the scipy.stats.rv_continuous
class, but the rv_continuous
class must be subclassed and instantiated before its pdf
method can be used.
For the rest, I'm assuming you're using rv_continuous
as a variable that refers to a SciPy distribution, e.g. rv_continuous = stats.norm
. You may want to skip to the code at the end, but I will address each of the statements that indicates a problem first.
I tried to parse directly like
scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2])
this will give me two values for pdf for one point.
It difficult to debug this without knowing what distribution rv_continuous
refers to and what my_values
is. If rv_continuous
is a variable that refers to a SciPy distribution, if my_values
is a scalar, if paras
is the output of rv_continuous.fit
, and if rv_continuous
has three parameters (including loc
and scale
), there will only be one output, so the problem must lie in information not included here.
I also tried to parse in this way
scipy.stats.rv_continuous.pdf(my_values, paras[0], paras[1], paras[2])
But the outcome is wierd.
Since this is identical to the previous way, we would expect it to have the same behavior. Please consider elaborating on what the output is because "weird" can mean many things.
In any case, it sounds like this code will help. For each of two different distributions, it fits the distribution to data, creates a frozen distribution from the fitted parameters, and computes the PDF at a point. The two distributions have diffferent numbers of parameters, so you can see that the code works regardless of how many parameters the distribution has.
import numpy as np
from scipy import stats
rng = np.random.default_rng()
data = rng.normal(size=1000)
for family_name in ['norm', 'skewnorm']:
family = getattr(stats, family_name)
params = family.fit(data)
dist = family(*params) # note use of `*` to automatically unpack `params`
print(f"{family_name}{params}.pdf(1): {dist.pdf(1)}")
# norm(-0.004263933560864075, 0.9864850754623957).pdf(1): 0.2408655741640401
# skewnorm(-0.9815681271426395, 0.660439450142722, 1.1895346103612483).pdf(1): 0.25093359040244
Upvotes: 0