Reputation: 1632
I have a probability density function of an unknown distribution which is given as a set of tuples (x, f(x)), where x=numpy.arange(0,1,size)
and f(x) is the corresponding probability.
What is the best way to identify the corresponding distribution? So far my idea is to draw a large amount of samples based on the pdf (by writing the code myself from scratch) and then use the obtained data to fit all of the distributions implemented in scipy.stats, then take the best fit.
Is there a better way to solve this problem? For example, is there some kind of utility in scipy.stats that I'm missing that would help me solve this problem?
Upvotes: 4
Views: 844
Reputation: 76406
In a fundamental sense, it's not really possible to summarize a distribution based on empirical samples - see here a discussion.
It's possible to do something more limited, which is to reject/accept the hypothesis that it comes out of one of a finite set of (parametric) distributions, based on a somewhat arbitrary criterion.
Given the finite set of distributions, for each distribution, you could perhaps realistically do the following:
Fit the distribution's parameters to the data. E.g., scipy.stats.beta.fit
will fit the best parameters of the Beta distribution (all scipy
distributions have this method).
Reject/accept the hypothesis that the data was generated by this distribution. There is more than a single way of doing this. A particularly simple way is to use the rvs()
method of the distribution to generate another sample, then use ks_2samp
to generate a Kolmogorov-Smirnov test.
Note that some specific distributions might have better, ad-hoc algorithms for testing whether a member of the distribution's family generated the data. As usual, the Normal distribution has many in particular (see Normalcy test).
Upvotes: 4