Computing the confidence intervals for the lambda parameter of an exponential distribution in Python

Suppose I have a sample, which I have reason to believe follows an exponential distribution. I want to estimate the distributions parameter (lambda) and some indication of the confidence. Either the confidence interval or the standard error would be fine. Sadly, scipy.stats.expon.fit does not seem to permit that. Here's an example, I will use lambda=1/120=0.008333 for the test data:

""" Generate test data"""
import scipy.stats
test_data = scipy.stats.expon(scale=120).rvs(size=3000)

""" Scipy.stats fit"""
fit = scipy.stats.expon.fit(test_data)
print("\nExponential parameters:", fit, " Specifically lambda: ", 1/fit[1], "\n")

# Exponential parameters: (0.0066790678905608875, 116.8376079908356)  Specifically lambda:  0.008558887991599736

The answer to a similar question about gamma-distributed data suggests using GenericLikelihoodModel from the statsmodels module. While I can confirm that this works nicely for gamma-distributed data, it does not for exponential distributions, because the optimization apparently results in a non-invertible Hessian matrix. This results either from non-finite elements in the Hessian or from np.linalg.eigh producing non-positive eigenvalues for the Hessian. (Source code here; HessianInversionWarning is raised in the fit method of the LikelihoodModel class.).

""" Statsmodel fit"""
from statsmodels.base.model import GenericLikelihoodModel

class Expon(GenericLikelihoodModel):
    nparams = 2
    def loglike(self, params):
        return scipy.stats.expon.logpdf(self.endog, *params).sum()


res = Expon(test_data).fit(start_params=fit)
res.df_model = len(fit)
res.df_resid = len(test_data) - len(fit)
print(res.summary())

#Optimization terminated successfully.
#         Current function value: 5.760785
#         Iterations: 38
#         Function evaluations: 76
#/usr/lib/python3.8/site-packages/statsmodels/tools/numdiff.py:352: RuntimeWarning: invalid value encountered in double_scalars
#  hess[i, j] = (f(*((x + ee[i, :] + ee[j, :],) + args), **kwargs)
#/usr/lib/python3.8/site-packages/statsmodels/base/model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
#  warn('Inverting hessian failed, no bse or cov_params '
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
#  return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
#  return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:1912: RuntimeWarning: invalid value encountered in less_equal
#  cond2 = cond0 & (x <= _a)
#                                Expon Results                                 
#==============================================================================
#Dep. Variable:                      y   Log-Likelihood:                -17282.
#Model:                          Expon   AIC:                         3.457e+04
#Method:            Maximum Likelihood   BIC:                         3.459e+04
#Date:                Thu, 06 Aug 2020                                         
#Time:                        13:55:24                                         
#No. Observations:                3000                                         
#Df Residuals:                    2998                                         
#Df Model:                           2                                         
#==============================================================================
#                 coef    std err          z      P>|z|      [0.025      0.975]
#------------------------------------------------------------------------------
#par0           0.0067        nan        nan        nan         nan         nan
#par1         116.8376        nan        nan        nan         nan         nan
#==============================================================================

This seems to happen every single time, so it might be something related to exponentially-distributed data.

Is there some other possible approach? Or am I maybe missing something or doing something wrong here?

Edit: It turns out, I was doing something wrong, namely, I wrongly had

test_data = scipy.stats.expon(120).rvs(size=3000)

instead of

test_data = scipy.stats.expon(scale=120).rvs(size=3000)

and was correspondingly looking at the forst element of the fit tuple while I should have been looking at the second.

As a result, the two other options I considered (manually computing the fit and confidence intervals following the standard procedure described on wikipedia) and using scikits.bootstrap as suggested in this answer actually does work and is part of the solution I'll add in a minute and not of the problem.

Upvotes: 1

Answers (2)

josephmure

Reputation: 228

You have already found a solution to your problem, but here is a solution based on OpenTURNS. I think it relies on bootstrapping under the hood.

OpenTURNS requires you to reshape the data so that it is clear we are dealing with 3000 one-dimensional points and not a single 3000-dimensional point.

test_data = test_data.reshape(-1, 1)

The rest is rather straightforward.

import openturns as ot

confidence_level = 0.9

params_ci = ot.ExponentialFactory().buildEstimator(test_data).getParameterDistribution().computeBilateralConfidenceInterval(confidence_level)

lambda_ci = [params_ci.getLowerBound()[0], params_ci.getUpperBound()[0]] 
# the index 0 means we are interested in the CI on lambda

print(lambda_ci)

I got the following output (but it depends on the random seed):

[0.008076302149561718, 0.008688296487447742]

Upvotes: 0

0range

Reputation: 2158

As mentioned in the edited question, part of the problem was that I was looking at the wrong parameter when creating the sample and again in the fit.

What remains is that scipy.stats.expon.fit does not offer the possibility to compute confidence or errors and that using GenericLikelihoodModel from the statsmodels module as suggested here fails because of a malformed Hessian.

There are, however three approaches that do work:

1. Using the simple inference procedure for confidence intervals in exponential data as given in the wikipedia article

""" Maximum likelihood"""
import numpy as np
ML_lambda = 1 / np.mean(test_data)
print("\nML lambda: {0:8f}".format(ML_lambda))

#ML lambda: 0.008558

""" Bias corrected ML"""
ML_BC_lambda = ML_lambda - ML_lambda / (len(test_data) - 1)
print("\nML bias-corrected lambda: {0:8f}".format(ML_BC_lambda))

#ML bias-corrected lambda: 0.008556

Computation of the confidence intervals::

""" Maximum likelihood 95% confidence"""
CI_distance = ML_BC_lambda * 1.96/(len(test_data)**0.5)
print("\nLambda with confidence intervals: {0:8f} +/- {1:8f}".format(ML_BC_lambda, CI_distance))
print("Confidence intervals: ({0:8f}, {1:9f})".format(ML_BC_lambda - CI_distance, ML_BC_lambda + CI_distance))

#Lambda with confidence intervals: 0.008556 +/- 0.000306
#Confidence intervals: (0.008249,  0.008862)

Second option with this: In addition, the confidence interval equation should also be valid for a lambda estimate produced by a different such as the one from scipy.stats.expon.fit. (I thought that the fitting procedure in scipy.stats.expon.fit was more reliable, but it turns out it is actually the same, without the bias correction (see above).)

""" Maximum likelihood 95% confidence based on scipy.stats fit"""
scipy_stats_lambda = 1 / fit[1]
scipy_stats_CI_distance = scipy_stats_lambda * 1.96/(len(test_data)**0.5)
print("\nOr, based on scipy.stats fit:")
print("Lambda with confidence intervals: {0:8f} +/- {1:8f}".format(scipy_stats_lambda, scipy_stats_CI_distance))
print("Confidence intervals: ({0:8f}, {1:9f})".format(scipy_stats_lambda - scipy_stats_CI_distance, 
                                                                scipy_stats_lambda + scipy_stats_CI_distance))

#Or, based on scipy.stats fit:
#Lambda with confidence intervals: 0.008559 +/- 0.000306
#Confidence intervals: (0.008253,  0.008865)

2. Bootstrapping with scikits.bootstrap following the suggestion in this answer This yields an InstabilityWarning: Some values were NaN; results are probably unstable (all values were probably equal), so this should be treated a bit skeptically.

""" Bootstrapping with scikits"""
print("\n")
import scikits.bootstrap as boot
bootstrap_result = boot.ci(test_data, scipy.stats.expon.fit)
print(bootstrap_result)

#tmp/expon_fit_test.py:53: InstabilityWarning: Some values were NaN; results are probably unstable (all values were probably equal)
#  bootstrap_result = boot.ci(test_data, scipy.stats.expon.fit)
#[[6.67906789e-03 1.12615588e+02]
# [6.67906789e-03 1.21127091e+02]]

3. Using rpy2

""" Using r modules with rpy2"""
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
MASS = importr('MASS')
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
rpy_fit = MASS.fitdistr(test_data, "exponential")
rpy_estimate = rpy_fit.rx("estimate")[0][0]
rpy_sd = rpy_fit.rx("sd")[0][0]
rpy_lower = rpy_estimate - 2*rpy_sd
rpy_upper = rpy_estimate + 2*rpy_sd
print("\nrpy2 fit: \nLambda={0:8f} +/- {1:8f}, CI: ({2:8f}, {3:8f})".format(rpy_estimate, rpy_sd, rpy_lower, rpy_upper))

#rpy2 fit: 
#Lambda=0.008558 +/- 0.000156, CI: (0.008246, 0.008871)

Upvotes: 1

Computing the confidence intervals for the lambda parameter of an exponential distribution in Python

Answers (2)

Related Questions