Reputation: 439
I'm trying to create some material for introductory statistics for a seminar. The above code computes a 95% confidence interval for estimating the mean, but the result is not the same from the one implemented in Python. Is there something wrong with my math / code? Thanks.
EDIT:
Data was sampled from here
import pandas as pd
import numpy as np
x = np.random.normal(60000,15000,200)
income = pd.DataFrame()
income = pd.DataFrame()
income['Data Scientist'] = x
# Manual Implementation
sample_mean = income['Data Scientist'].mean()
sample_std = income['Data Scientist'].std()
standard_error = sample_std / (np.sqrt(income.shape[0]))
print('Mean',sample_mean)
print('Std',sample_std)
print('Standard Error',standard_error)
print('(',sample_mean-2*standard_error,',',sample_mean+2*standard_error,')')
# Python Library
import scipy.stats as st
se = st.sem(income['Data Scientist'])
a = st.t.interval(0.95, len(income['Data Scientist'])-1, loc=sample_mean, scale=se)
print(a)
print('Standard Error from this code block',se)
Upvotes: 0
Views: 406
Reputation: 11903
You've got 2 errors.
First, you are using 2 for the multiplier for the CI. The more accurate value is 1.96. "2" is just a convenient estimator. That is making your CI generated manually too fat.
Second, you are comparing a normal distribution to the t-distribution. This probably isn't causing more than decimal dust in difference because you have 199 degrees of freedom for the t-dist, which is basically the normal.
Below is the z-score of 1.96 and computation of CI with apples-to-apples comparison to the norm distribution vs. t.
In [45]: st.norm.cdf(1.96)
Out[45]: 0.9750021048517795
In [46]: print('(',sample_mean-1.96*standard_error,',',sample_mean+1.96*standard_error,')')
( 57558.007862202685 , 61510.37559873406 )
In [47]: st.norm.interval(0.95, loc=sample_mean, scale=se)
Out[47]: (57558.044175045005, 61510.33928589174)
Upvotes: 1