BumbleTee
BumbleTee

Reputation: 21

Recovering parameters for wald distribution: from numpy to scipy

could someone please help with a questions around the parametrization of scipy distributions and how to transform them?

I basically would like to recover distribution parameters of data that I simulate with numpy...

some_data = np.random.normal(loc=81, scale=7, size=100000)

...by fitting a distribution with scipy

recovered_parms = scipy.stats.norm.fit(some_data)

For the normal distribution, this works. recovered_parms ~= (81,7)

However, for e.g. a wald distribution it does not.

some_data = np.random.wald(mean=4, scale=41, size=100000)

recovered_parms = scipy.stats.wald.fit(some_data)

Result: recovered_parms ~= (1.28,3.66)

I understand that they need to be transformed but just can't figure out how. Any help appreciated.

Upvotes: 1

Views: 839

Answers (3)

Umang Gupta
Umang Gupta

Reputation: 16460

If the problem is to just estimate the lambda and mean of the wald distribution. You can just do

mean = np.mean(some_data)
lambda_ = 1/(np.mean(1/some_data) - 1/mean) # lambda is a reserved keyword :/

This estimate seems to be pretty close than whatever the scipy.stats.wald fit is returning (if we interpret one of them as mean or we know how to interpret it)

Upvotes: 1

Warren Weckesser
Warren Weckesser

Reputation: 114911

numpy.random.wald has two parameters, mean and scale. scale is, as the name suggests, a scale parameter, in the sense of a location-scale family. mean is a shape parameter; it is not a location parameter.

If you look at the docstring for numpy.random.wald, it says "Draw samples from a Wald, or inverse Gaussian, distribution." The docstring for scipy.stats.wald, however, says that it is "a special case of invgauss with mu == 1", where mu is a shape parameter of scipy.stats.invgauss. scipy.stats.wald has only two parameters, loc and scale. (All the continuous distributions in scipy.stats have these parameters.) So the parameters of numpy.random.wald and scipy.stats.wald don't match up: numpy.random.wald has a shape and a scale parameter, but scipy.stats.wald has a location and a scale parameter.

Instead of scipy.stats.wald, you must use scipy.stats.invgauss to fit data generated with numpy.random.wald. scipy.stats.invgauss is an implementation of the inverse Gaussian distribution that is mentioned in the docstring of numpy.random.wald. scipy.stats.invgauss has three parameters: one shape parameter called mu, along with the standard location (loc) and scale parameters.

The shape parameter mu of scipy.stats.invgauss is not the same as the shape parameter mean of numpy.random.wald. If you do a little algebra with the PDFs of the two functions, you'll find that the relation is

mean = mu * scale

where mu is the invgauss shape parameter, mean is the shape parameter used in numpy.random.wald, and scale has the same meaning in both functions.

If you generate a sample using numpy.random.wald and you then want to recover the parameters by fitting the inverse Gaussian distribution to it, you must use the above relation to convert the result of the fit to the mean used by numpy.random.wald. Also, numpy.random.wald doesn't have a location parameter, so you must restrict the location of scipy.stats.invgauss to be 0 by using the argument floc=0 in scipy.stats.invgauss.fit().

Here's an example. First, generate some data using numpy.random.wald:

In [55]: m = 4

In [56]: s = 41

In [57]: some_data = np.random.wald(mean=m, scale=s, size=100000)

Now fit scipy.stats.invgauss to that data, with the restriction that the location parameter is 0:

In [58]: from scipy.stats import invgauss

In [59]: mu, loc, scale = invgauss.fit(some_data, floc=0)

In [60]: mu, loc, scale
Out[60]: (0.097186409353576975, 0, 41.155034600558793)

As expected, the scale parameter is close to the parameter that was used to generate the data. To get the estimate of the shape parameter that was used, multiply mu and scale:

In [61]: mu*scale
Out[61]: 3.9997100396505312

It is approximately 4, as expected.

A plot is always useful for visualizing the fit. In the plot, the blue bars show the normalized histogram of the data, and the black curve is the PDF of the fitted inverse Gaussian distribution.

In [86]: import matplotlib.pyplot as plt

In [87]: _ = plt.hist(some_data, bins=40, normed=True, alpha=0.6)

In [88]: xx = np.linspace(some_data.min(), some_data.max(), 500)

In [89]: yy = invgauss.pdf(xx, mu, loc, scale)

In [90]: plt.plot(xx, yy, 'k')
Out[90]: [<matplotlib.lines.Line2D at 0x11b6d64e0>]

plot

Upvotes: 3

Bill Bell
Bill Bell

Reputation: 21663

I don't know that you can; this appears to be a can of worms. See if you agree with my reasoning.

from numpy.random import wald
import scipy.stats

means = [1, 2, 4, 8]
samples = [wald(mean=mean, scale=1, size=100000) for mean in means]

print(('{:>10d}'*len(means)).format(*means))
stats = [scipy.stats.wald.fit(sample) for sample in samples]
print(('{:>10.2f}'*len(means)).format(*[stat[1] for stat in stats]))
print(('{:>10.2f}'*len(means)).format(*[stat[0] for stat in stats]))

scales = [1, 4, 16, 64]
samples = [wald(mean=1, scale=scale, size=100000) for scale in scales]

print(('{:>10d}'*len(scales)).format(*scales))
stats = [scipy.stats.wald.fit(sample) for sample in samples]
print(('{:>10.2f}'*len(scales)).format(*[stat[1] for stat in stats]))
print(('{:>10.2f}'*len(scales)).format(*[stat[0] for stat in stats]))

First I generate four samples, one for each of the means 1, 2, 4 and 8, keeping the scale the same at 1. I calculate a fit for each sample. Then I generate another four samples, one for each of the scales 1, 4, 16 and 64, this time keeping the mean the same at 1.

Here are the results.

     1         2         4         8
  1.00      1.90      3.53      6.43
 -0.00     -0.13     -0.43     -1.06
     1         4        16        64
  1.00      1.14      0.92      0.68
  0.00      0.12      0.35      0.55

I would expect the location to appear first in each pair of results but it appears that location is second. Still, at least the location does approximate the mean, even if it shows an increasing negative bias. It's difficult to interpret the scale. Over a large range the scale estimates might be on a logarithm scale.

This might be a question to put on the developer's site.

Upvotes: 0

Related Questions