Reputation: 21
could someone please help with a questions around the parametrization of scipy distributions and how to transform them?
I basically would like to recover distribution parameters of data that I simulate with numpy...
some_data = np.random.normal(loc=81, scale=7, size=100000)
...by fitting a distribution with scipy
recovered_parms = scipy.stats.norm.fit(some_data)
For the normal distribution, this works. recovered_parms ~= (81,7)
However, for e.g. a wald distribution it does not.
some_data = np.random.wald(mean=4, scale=41, size=100000)
recovered_parms = scipy.stats.wald.fit(some_data)
Result: recovered_parms ~= (1.28,3.66)
I understand that they need to be transformed but just can't figure out how. Any help appreciated.
Upvotes: 1
Views: 839
Reputation: 16460
If the problem is to just estimate the lambda and mean of the wald distribution. You can just do
mean = np.mean(some_data)
lambda_ = 1/(np.mean(1/some_data) - 1/mean) # lambda is a reserved keyword :/
This estimate seems to be pretty close than whatever the scipy.stats.wald fit is returning (if we interpret one of them as mean or we know how to interpret it)
Upvotes: 1
Reputation: 114911
numpy.random.wald
has two parameters, mean
and scale
.
scale
is, as the name suggests, a scale parameter, in the sense
of a location-scale family.
mean
is a shape parameter; it is not a location parameter.
If you look at the docstring for numpy.random.wald
, it says
"Draw samples from a Wald, or inverse Gaussian, distribution."
The docstring for scipy.stats.wald
, however, says that it is
"a special case of invgauss
with mu == 1
", where mu
is a
shape parameter of scipy.stats.invgauss
. scipy.stats.wald
has
only two parameters, loc
and scale
. (All the continuous
distributions in scipy.stats
have these parameters.) So the parameters
of numpy.random.wald
and scipy.stats.wald
don't match up:
numpy.random.wald
has a shape and a scale parameter, but scipy.stats.wald
has a location and a scale parameter.
Instead of scipy.stats.wald
, you must use scipy.stats.invgauss
to fit data generated with numpy.random.wald
.
scipy.stats.invgauss
is an implementation of the inverse Gaussian distribution that is
mentioned in the docstring of numpy.random.wald
. scipy.stats.invgauss
has three parameters: one shape parameter called mu
, along with the
standard location (loc
) and scale parameters.
The shape parameter mu
of scipy.stats.invgauss
is not the same as the
shape parameter mean
of numpy.random.wald
. If you do a little algebra with the PDFs of the two functions, you'll find that the relation is
mean = mu * scale
where mu
is the invgauss
shape parameter, mean
is the shape parameter used in numpy.random.wald
, and scale
has the same meaning in both functions.
If you generate a sample using numpy.random.wald
and you then want to
recover the parameters by fitting the inverse Gaussian distribution to
it, you must use the above relation to convert the result of the fit to
the mean
used by numpy.random.wald
. Also, numpy.random.wald
doesn't
have a location parameter, so you must restrict the location of scipy.stats.invgauss
to be 0 by using the argument floc=0
in scipy.stats.invgauss.fit()
.
Here's an example. First, generate some data using numpy.random.wald
:
In [55]: m = 4
In [56]: s = 41
In [57]: some_data = np.random.wald(mean=m, scale=s, size=100000)
Now fit scipy.stats.invgauss
to that data, with the restriction that the
location parameter is 0:
In [58]: from scipy.stats import invgauss
In [59]: mu, loc, scale = invgauss.fit(some_data, floc=0)
In [60]: mu, loc, scale
Out[60]: (0.097186409353576975, 0, 41.155034600558793)
As expected, the scale
parameter is close to the parameter that was used to generate the data. To get the estimate of the shape parameter that was used, multiply mu
and scale
:
In [61]: mu*scale
Out[61]: 3.9997100396505312
It is approximately 4, as expected.
A plot is always useful for visualizing the fit. In the plot, the blue bars show the normalized histogram of the data, and the black curve is the PDF of the fitted inverse Gaussian distribution.
In [86]: import matplotlib.pyplot as plt
In [87]: _ = plt.hist(some_data, bins=40, normed=True, alpha=0.6)
In [88]: xx = np.linspace(some_data.min(), some_data.max(), 500)
In [89]: yy = invgauss.pdf(xx, mu, loc, scale)
In [90]: plt.plot(xx, yy, 'k')
Out[90]: [<matplotlib.lines.Line2D at 0x11b6d64e0>]
Upvotes: 3
Reputation: 21663
I don't know that you can; this appears to be a can of worms. See if you agree with my reasoning.
from numpy.random import wald
import scipy.stats
means = [1, 2, 4, 8]
samples = [wald(mean=mean, scale=1, size=100000) for mean in means]
print(('{:>10d}'*len(means)).format(*means))
stats = [scipy.stats.wald.fit(sample) for sample in samples]
print(('{:>10.2f}'*len(means)).format(*[stat[1] for stat in stats]))
print(('{:>10.2f}'*len(means)).format(*[stat[0] for stat in stats]))
scales = [1, 4, 16, 64]
samples = [wald(mean=1, scale=scale, size=100000) for scale in scales]
print(('{:>10d}'*len(scales)).format(*scales))
stats = [scipy.stats.wald.fit(sample) for sample in samples]
print(('{:>10.2f}'*len(scales)).format(*[stat[1] for stat in stats]))
print(('{:>10.2f}'*len(scales)).format(*[stat[0] for stat in stats]))
First I generate four samples, one for each of the means 1, 2, 4 and 8, keeping the scale the same at 1. I calculate a fit for each sample. Then I generate another four samples, one for each of the scales 1, 4, 16 and 64, this time keeping the mean the same at 1.
Here are the results.
1 2 4 8
1.00 1.90 3.53 6.43
-0.00 -0.13 -0.43 -1.06
1 4 16 64
1.00 1.14 0.92 0.68
0.00 0.12 0.35 0.55
I would expect the location to appear first in each pair of results but it appears that location is second. Still, at least the location does approximate the mean, even if it shows an increasing negative bias. It's difficult to interpret the scale. Over a large range the scale estimates might be on a logarithm scale.
This might be a question to put on the developer's site.
Upvotes: 0