Reputation: 19805
I have an empirical distribution and I am trying to fit a T
distribution to it using numpy
and plotting it with matplotlib
.
Here is something I cannot understand:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
arr = np.array( [140, 36, 44, 24, 15, 48, 19, 2, 84, 6, 70, 3, 20, 6, 133, 23, 30, 7, 37, 165] )
params = t.fit( arr )
mean = arr.mean()
std = arr.std()
r = np.arange( mean - 3 * std, mean + 3 * std, 0.01 )
pdf_fitted = t.pdf(r, *params[0:-2], loc=params[-2], scale=params[-1])
plt.plot( r, pdf_fitted )
plt.plot( [mean, mean], [0, max(pdf_fitted)] )
plt.show()
This plots:
The green line is the mean of the emprical data, and the blue line is the fitted T
distribution to the same data.
The problem is the empirical mean and the peak of the distribution do not match. When I fit a normal
distribution to the same data, I get a perfect match with the green line and the peak of the distribution, as expected.
Now, looking to the Wikipedia T distribution:
The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails...
Since it says it is symmetric I expect that my mean and the peak match perfectly, but it does not.
My question is: Is there anything wrong with my Python code or is it the expected behavior of the T
distribution? If yes, why? If no, what I am doing wrong with my code?
Upvotes: 2
Views: 217
Reputation: 648
There's no bug in the Python code as far as I can see; actually this is a good example to illustrate the robustness of the Student t distribution compared to the Gaussian. One characteristic of exponential family distributions (Gaussian, Exponential, Binomial, Poisson, etc.) is that they have really thin tails, meaning that the pdf decreases exponentially as you deviate from the mean. This characteristic gives them nice theoretical properties, but is often the bottleneck in applying them to model real-world distributions, where outliers abound in the dataset. Therefore, the t distribution is a popular alternative, because a couple outliers in your observed dataset wouldn't affect your inferences much. In your example, think about the original dataset as consisting of all points except the three high outliers. However, these outliers were, say introduced in some noisy process. Statistical inference aims to describe properties (say, the mean) of the original dataset, so suppose you used a Gaussian in this case you would have grossly over-estimated the true mean. If you used the t in this case, it would not match the mean of your noisy sample, but it would be much a much more accurate estimate of the original true mean, regardless of the outliers.
Upvotes: 3