Reputation: 2158
Suppose I have a sample, which I have reason to believe follows an exponential distribution. I want to estimate the distributions parameter (lambda) and some indication of the confidence. Either the confidence interval or the standard error would be fine. Sadly, scipy.stats.expon.fit
does not seem to permit that. Here's an example, I will use lambda=1/120=0.008333 for the test data:
""" Generate test data"""
import scipy.stats
test_data = scipy.stats.expon(scale=120).rvs(size=3000)
""" Scipy.stats fit"""
fit = scipy.stats.expon.fit(test_data)
print("\nExponential parameters:", fit, " Specifically lambda: ", 1/fit[1], "\n")
# Exponential parameters: (0.0066790678905608875, 116.8376079908356) Specifically lambda: 0.008558887991599736
The answer to a similar question about gamma-distributed data suggests using GenericLikelihoodModel
from the statsmodels
module. While I can confirm that this works nicely for gamma-distributed data, it does not for exponential distributions, because the optimization apparently results in a non-invertible Hessian matrix. This results either from non-finite elements in the Hessian or from np.linalg.eigh
producing non-positive eigenvalues for the Hessian. (Source code here; HessianInversionWarning is raised in the fit
method of the LikelihoodModel
class.).
""" Statsmodel fit"""
from statsmodels.base.model import GenericLikelihoodModel
class Expon(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
return scipy.stats.expon.logpdf(self.endog, *params).sum()
res = Expon(test_data).fit(start_params=fit)
res.df_model = len(fit)
res.df_resid = len(test_data) - len(fit)
print(res.summary())
#Optimization terminated successfully.
# Current function value: 5.760785
# Iterations: 38
# Function evaluations: 76
#/usr/lib/python3.8/site-packages/statsmodels/tools/numdiff.py:352: RuntimeWarning: invalid value encountered in double_scalars
# hess[i, j] = (f(*((x + ee[i, :] + ee[j, :],) + args), **kwargs)
#/usr/lib/python3.8/site-packages/statsmodels/base/model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
# warn('Inverting hessian failed, no bse or cov_params '
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
# return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
# return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:1912: RuntimeWarning: invalid value encountered in less_equal
# cond2 = cond0 & (x <= _a)
# Expon Results
#==============================================================================
#Dep. Variable: y Log-Likelihood: -17282.
#Model: Expon AIC: 3.457e+04
#Method: Maximum Likelihood BIC: 3.459e+04
#Date: Thu, 06 Aug 2020
#Time: 13:55:24
#No. Observations: 3000
#Df Residuals: 2998
#Df Model: 2
#==============================================================================
# coef std err z P>|z| [0.025 0.975]
#------------------------------------------------------------------------------
#par0 0.0067 nan nan nan nan nan
#par1 116.8376 nan nan nan nan nan
#==============================================================================
This seems to happen every single time, so it might be something related to exponentially-distributed data.
Is there some other possible approach? Or am I maybe missing something or doing something wrong here?
Edit: It turns out, I was doing something wrong, namely, I wrongly had
test_data = scipy.stats.expon(120).rvs(size=3000)
instead of
test_data = scipy.stats.expon(scale=120).rvs(size=3000)
and was correspondingly looking at the forst element of the fit tuple while I should have been looking at the second.
As a result, the two other options I considered (manually computing the fit and confidence intervals following the standard procedure described on wikipedia) and using scikits.bootstrap
as suggested in this answer actually does work and is part of the solution I'll add in a minute and not of the problem.
Upvotes: 1
Views: 1845
Reputation: 228
You have already found a solution to your problem, but here is a solution based on OpenTURNS. I think it relies on bootstrapping under the hood.
OpenTURNS requires you to reshape the data so that it is clear we are dealing with 3000 one-dimensional points and not a single 3000-dimensional point.
test_data = test_data.reshape(-1, 1)
The rest is rather straightforward.
import openturns as ot
confidence_level = 0.9
params_ci = ot.ExponentialFactory().buildEstimator(test_data).getParameterDistribution().computeBilateralConfidenceInterval(confidence_level)
lambda_ci = [params_ci.getLowerBound()[0], params_ci.getUpperBound()[0]]
# the index 0 means we are interested in the CI on lambda
print(lambda_ci)
I got the following output (but it depends on the random seed):
[0.008076302149561718, 0.008688296487447742]
Upvotes: 0
Reputation: 2158
As mentioned in the edited question, part of the problem was that I was looking at the wrong parameter when creating the sample and again in the fit.
What remains is that scipy.stats.expon.fit
does not offer the possibility to compute confidence or errors and that using GenericLikelihoodModel
from the statsmodels
module as suggested here fails because of a malformed Hessian.
There are, however three approaches that do work:
1. Using the simple inference procedure for confidence intervals in exponential data as given in the wikipedia article
""" Maximum likelihood"""
import numpy as np
ML_lambda = 1 / np.mean(test_data)
print("\nML lambda: {0:8f}".format(ML_lambda))
#ML lambda: 0.008558
""" Bias corrected ML"""
ML_BC_lambda = ML_lambda - ML_lambda / (len(test_data) - 1)
print("\nML bias-corrected lambda: {0:8f}".format(ML_BC_lambda))
#ML bias-corrected lambda: 0.008556
Computation of the confidence intervals::
""" Maximum likelihood 95% confidence"""
CI_distance = ML_BC_lambda * 1.96/(len(test_data)**0.5)
print("\nLambda with confidence intervals: {0:8f} +/- {1:8f}".format(ML_BC_lambda, CI_distance))
print("Confidence intervals: ({0:8f}, {1:9f})".format(ML_BC_lambda - CI_distance, ML_BC_lambda + CI_distance))
#Lambda with confidence intervals: 0.008556 +/- 0.000306
#Confidence intervals: (0.008249, 0.008862)
Second option with this: In addition, the confidence interval equation should also be valid for a lambda estimate produced by a different such as the one from scipy.stats.expon.fit
. (I thought that the fitting procedure in scipy.stats.expon.fit
was more reliable, but it turns out it is actually the same, without the bias correction (see above).)
""" Maximum likelihood 95% confidence based on scipy.stats fit"""
scipy_stats_lambda = 1 / fit[1]
scipy_stats_CI_distance = scipy_stats_lambda * 1.96/(len(test_data)**0.5)
print("\nOr, based on scipy.stats fit:")
print("Lambda with confidence intervals: {0:8f} +/- {1:8f}".format(scipy_stats_lambda, scipy_stats_CI_distance))
print("Confidence intervals: ({0:8f}, {1:9f})".format(scipy_stats_lambda - scipy_stats_CI_distance,
scipy_stats_lambda + scipy_stats_CI_distance))
#Or, based on scipy.stats fit:
#Lambda with confidence intervals: 0.008559 +/- 0.000306
#Confidence intervals: (0.008253, 0.008865)
2. Bootstrapping with scikits.bootstrap
following the suggestion in this answer
This yields an InstabilityWarning: Some values were NaN; results are probably unstable (all values were probably equal)
, so this should be treated a bit skeptically.
""" Bootstrapping with scikits"""
print("\n")
import scikits.bootstrap as boot
bootstrap_result = boot.ci(test_data, scipy.stats.expon.fit)
print(bootstrap_result)
#tmp/expon_fit_test.py:53: InstabilityWarning: Some values were NaN; results are probably unstable (all values were probably equal)
# bootstrap_result = boot.ci(test_data, scipy.stats.expon.fit)
#[[6.67906789e-03 1.12615588e+02]
# [6.67906789e-03 1.21127091e+02]]
3. Using rpy2
""" Using r modules with rpy2"""
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
MASS = importr('MASS')
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
rpy_fit = MASS.fitdistr(test_data, "exponential")
rpy_estimate = rpy_fit.rx("estimate")[0][0]
rpy_sd = rpy_fit.rx("sd")[0][0]
rpy_lower = rpy_estimate - 2*rpy_sd
rpy_upper = rpy_estimate + 2*rpy_sd
print("\nrpy2 fit: \nLambda={0:8f} +/- {1:8f}, CI: ({2:8f}, {3:8f})".format(rpy_estimate, rpy_sd, rpy_lower, rpy_upper))
#rpy2 fit:
#Lambda=0.008558 +/- 0.000156, CI: (0.008246, 0.008871)
Upvotes: 1