Reputation: 417
I am fitting synthetic data to various distributions in scipy, however, I am observing some unexpected results. My data contains negative numbers and I do not get an error when I fit this data to distributions with non negative supports when fixing location and scale. My code is as follows:
import scipy.stats as st
import numpy as np
import pandas as pd
np.random.seed(7)
test_data = pd.Series(0.5 + 0.1*np.sin(np.linspace(0, 3*np.pi, 100)) + 0.5*np.random.normal(0,1,size=100))
print(np.min(test_data))
Which returns:
-0.5900934692403015
Confirming I have generated negative observations. When I fit scipy lognorm, which has a non inclusive non negative support, I get the expected result of an error where the data bounds are violated:
st.lognorm.fit(test_data, floc=0, fscale=1)
---------------------------------------------------------------------------
FitDataError Traceback (most recent call last)
<ipython-input-13-fbeaae8f3c2e> in <module>
----> 1 st.lognorm.fit(test_data, floc=0, fscale=1)
~\Miniconda3\lib\site-packages\scipy\stats\_continuous_distns.py in fit(self, data, *args, **kwds)
5087 data = data - floc
5088 if np.any(data <= 0):
-> 5089 raise FitDataError("lognorm", lower=floc, upper=np.inf)
5090 lndata = np.log(data)
5091
FitDataError: Invalid values in `data`. Maximum likelihood estimation with 'lognorm' requires that 0.0 < x < inf for each x in `data`.
However, with the following distributions, I am able to fit the data, despite the fact that all of these distributions have non negative data bounds (as defined by their scipy documentation) and fixed location and scale.
st.burr.fit(test_data, floc=0, fscale=1)
st.expon.fit(test_data)
st.chi2.fit(test_data, floc=0, fscale=1)
st.invgauss.fit(test_data, floc=0, fscale=1)
st.invgamma.fit(test_data, floc=0, fscale=1)
Which yield:
(4.435119987970436, 0.32475585134451646, 0, 1)
(-0.5900934692403015, 1.1171187649605647)
(1.349414062500001, 0, 1)
(0.6815429687499996, 0, 1)
(2.301074218750003, 0, 1)
Additionally, the distribution expon without any shape parameters is able parameters was able to execute which was surprising. If someone could explain how these distributions are able to fit to the data despite the fact that their support bounds have been violated I would really appreciate it.
I am running numpy 1.19.2 and scipy 1.5.2
Thank you!
Upvotes: 0
Views: 1367
Reputation: 2249
The fact that those fit
didn't throw any error doesn't mean that they have been a good fit or that they can describe your data.
I'm using scipy==1.6.1
.
You can check plotting results
x = np.linspace(test_data.min(), test_data.max(), 100)
Burr: no error, bu cannot describe data <0
burr_pars = sps.burr.fit(test_data, floc=0, fscale=1)
y = sps.burr(*burr_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);
Expon: no error, but very bad fit
expon_pars = sps.expon.fit(test_data)
y = sps.expon(*expon_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);
Chi2: no error but very bad fit and cannot describe data <0
chi2_pars = sps.chi2.fit(test_data, floc=0, fscale=1)
y = sps.chi2(*chi2_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);
Invgauss: error
invgauss_pars = sps.invgauss.fit(test_data, floc=0, fscale=1)
FitDataError: Invalid values in `data`. Maximum likelihood estimation with 'invgauss' requires that 0 < (x - loc)/scale < inf for each x in `data`.
If you don't set loc and scale, works best for x>=0, but given the formula of its PDF there is no reason why it should throw an error for x<0
invgauss_pars = sps.invgauss.fit(test_data)
y = sps.invgauss(*invgauss_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);
Invgamma: a warning, bad fit and cannot describe for x<0
invagamm_pars = sps.invgamma.fit(test_data, floc=0, fscale=1)
y = sps.invgauss(*invagamm_pars).pdf(x)
plt.plot(x, y)
plt.hist(test_data, alpha=.5, density=True);
RuntimeWarning: invalid value encountered in double_scalars
Lhat = muhat - Shat*mu
From https://github.com/scipy/scipy/blob/v1.6.3/scipy/stats/_continuous_distns.py you see that FitDataError
is called only by beta
, expon
(but if floc is None
then floc = data_min
), gamma
, invgauss
(but only np.any(data - floc < 0)
), lognorm
, pareto
, rayleigh
, uniform
.
For other distributions FitDataError
is not implemented.
Upvotes: 1