njalex22
njalex22

Reputation: 417

Scipy rv_continuous fit does not check input data bounds

I am fitting synthetic data to various distributions in scipy, however, I am observing some unexpected results. My data contains negative numbers and I do not get an error when I fit this data to distributions with non negative supports when fixing location and scale. My code is as follows:

import scipy.stats as st
import numpy as np
import pandas as pd

np.random.seed(7)
test_data = pd.Series(0.5 + 0.1*np.sin(np.linspace(0, 3*np.pi, 100)) + 0.5*np.random.normal(0,1,size=100))
print(np.min(test_data))

Which returns:

-0.5900934692403015

Confirming I have generated negative observations. When I fit scipy lognorm, which has a non inclusive non negative support, I get the expected result of an error where the data bounds are violated:

st.lognorm.fit(test_data, floc=0, fscale=1)
---------------------------------------------------------------------------
FitDataError                              Traceback (most recent call last)
<ipython-input-13-fbeaae8f3c2e> in <module>
----> 1 st.lognorm.fit(test_data, floc=0, fscale=1)

~\Miniconda3\lib\site-packages\scipy\stats\_continuous_distns.py in fit(self, data, *args, **kwds)
   5087             data = data - floc
   5088         if np.any(data <= 0):
-> 5089             raise FitDataError("lognorm", lower=floc, upper=np.inf)
   5090         lndata = np.log(data)
   5091 

FitDataError: Invalid values in `data`.  Maximum likelihood estimation with 'lognorm' requires that 0.0 < x < inf for each x in `data`.

However, with the following distributions, I am able to fit the data, despite the fact that all of these distributions have non negative data bounds (as defined by their scipy documentation) and fixed location and scale.

st.burr.fit(test_data, floc=0, fscale=1)

st.expon.fit(test_data)

st.chi2.fit(test_data, floc=0, fscale=1)

st.invgauss.fit(test_data, floc=0, fscale=1)

st.invgamma.fit(test_data, floc=0, fscale=1)

Which yield:

(4.435119987970436, 0.32475585134451646, 0, 1)
(-0.5900934692403015, 1.1171187649605647)
(1.349414062500001, 0, 1)
(0.6815429687499996, 0, 1)
(2.301074218750003, 0, 1)

Additionally, the distribution expon without any shape parameters is able parameters was able to execute which was surprising. If someone could explain how these distributions are able to fit to the data despite the fact that their support bounds have been violated I would really appreciate it.

I am running numpy 1.19.2 and scipy 1.5.2

Thank you!

Upvotes: 0

Views: 1367

Answers (1)

Max Pierini
Max Pierini

Reputation: 2249

The fact that those fit didn't throw any error doesn't mean that they have been a good fit or that they can describe your data.

I'm using scipy==1.6.1.

You can check plotting results

x = np.linspace(test_data.min(), test_data.max(), 100)

Burr: no error, bu cannot describe data <0

burr_pars = sps.burr.fit(test_data, floc=0, fscale=1)
y = sps.burr(*burr_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);

enter image description here

Expon: no error, but very bad fit

expon_pars = sps.expon.fit(test_data)
y = sps.expon(*expon_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);

enter image description here

Chi2: no error but very bad fit and cannot describe data <0

chi2_pars = sps.chi2.fit(test_data, floc=0, fscale=1)
y = sps.chi2(*chi2_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);

enter image description here

Invgauss: error

invgauss_pars = sps.invgauss.fit(test_data, floc=0, fscale=1)
FitDataError: Invalid values in `data`.  Maximum likelihood estimation with 'invgauss' requires that 0 < (x - loc)/scale  < inf for each x in `data`.

If you don't set loc and scale, works best for x>=0, but given the formula of its PDF there is no reason why it should throw an error for x<0

invgauss_pars = sps.invgauss.fit(test_data)
y = sps.invgauss(*invgauss_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);

enter image description here

Invgamma: a warning, bad fit and cannot describe for x<0

invagamm_pars = sps.invgamma.fit(test_data, floc=0, fscale=1)
y = sps.invgauss(*invagamm_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);
RuntimeWarning: invalid value encountered in double_scalars
  Lhat = muhat - Shat*mu

enter image description here

EDIT

From https://github.com/scipy/scipy/blob/v1.6.3/scipy/stats/_continuous_distns.py you see that FitDataError is called only by beta, expon (but if floc is None then floc = data_min), gamma, invgauss (but only np.any(data - floc < 0)), lognorm, pareto, rayleigh, uniform.

For other distributions FitDataError is not implemented.

Upvotes: 1

Related Questions