Reputation: 568
scipy.stats.rv_continuous.fit, allows you to fix parameters when fitting a distribution, but it's dependent on scipy's choice of parametrization. For the gamma distribution is uses the k, theta (shape, scale) parametrization, so it would be easy to fit while holding theta constant, for example. I want to fit to a data set where I know the mean, but the observed mean might vary due to sampling error. This would be easy if scipy used the parametrization that uses mu = k*theta instead of theta. Is there a way to make scipy do this? And if not, is there another library that can?
Here's some example code with a data set has an observed mean of 9.952, but I know the actual mean of the underlying distribution is 11:
from scipy.stats import gamma
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
shape, _, scale = gamma.fit(observations, floc = 0)
print(shape*scale)
and this gives
9.952
but I would like a fit such that shape*scale = 11.0
Upvotes: 1
Views: 1081
Reputation: 114781
The fit
method of the SciPy distributions provides the maximum likelihood estimate of the parameters. You are correct that it only provides for fitting the shape, location and scale. (Actually, you said shape and scale, but SciPy also includes a location parameter. Sometimes this is called the three parameter gamma distribution.)
For most of the distributions in SciPy, the fit
method uses a numerical optimizer to minimize the negative log-likelihood, as defined in the nnlf
method. Instead of using the fit
method, you could do this yourself with just a couple lines of code. This allows you to create an objective function with just one parameter, say the shape k
, and within that function, set theta = mean/k
, where mean
is the desired mean, and call gamma.nnlf
to evaluate the negative log-likelihood. Here's one way you could do it:
import numpy as np
from scipy.stats import gamma
from scipy.optimize import fmin
def nll(k, mean, x):
return gamma.nnlf(np.array([k[0], 0, mean/k[0]]), x)
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
# This is the desired mean of the distribution.
mean = 11
# Initial guess for the shape parameter.
k0 = 3.0
opt = fmin(nll, k0, args=(mean, np.array(observations)),
xtol=1e-11, disp=False)
k_opt = opt[0]
theta_opt = mean / k_opt
print(f"k_opt: {k_opt:9.7f}")
print(f"theta_opt: {theta_opt:9.7f}")
This script prints
k_opt: 1.9712604
theta_opt: 5.5801861
Alternatively, one can modify the first order conditions for the extremum of the log-likelihood shown in wikipedia so that there is just one parameter, k
. Then the condition for the extreme value can be implemented as a scalar equation whose root can be found with, say, scipy.optimize.fsolve
. The following is a variation of the above script that uses this technique.
import numpy as np
from scipy.special import digamma
from scipy.optimize import fsolve
def first_order_eq(k, mean, x):
mean_logx = np.mean(np.log(x))
return (np.log(k) - digamma(k) + mean_logx - np.mean(x)/mean
- np.log(mean) + 1)
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
# This is the desired mean of the distribution.
mean = 11
# Initial guess for the shape parameter.
k0 = 3
sol = fsolve(first_order_eq, k0, args=(mean, observations),
xtol=1e-11)
k_opt = sol[0]
theta_opt = mean / k_opt
print(f"k_opt: {k_opt:9.7f}")
print(f"theta_opt: {theta_opt:9.7f}")
Output:
k_opt: 1.9712604
theta_opt: 5.5801861
Upvotes: 4