Reputation: 6333
I calculated a 95% confidence interval using scipy and the result is different to what I was expecting.
I am solving a problem where someone rolled a die 20K times and observed 3,932 sixes. I am being asked to build a 95% confidence interval for the probability of rolling a six. The number of sixes follows a binomial distribution with 20K repetitions and a probability of success of 3,932 / 20K.
# Number of observations
n_obs = 20000
# Observed proportion of successes
p_obs = 3932 / n_obs
# Observed standard deviation
s_obs = numpy.sqrt((p_obs * (1 - p_obs)) / n_obs)
If I use a normal approximation with these parameters, the confidence interval should be p_obs
± 1.96 * s_obs
. That is, between 0.1911 and 0.2021.
However, if I do the following, it returns a completely different interval.
# Declare normal random variable
X = scipy.stats.norm(loc=p_obs, scale=s_obs)
# Get interval
X.interval(alpha=0.05)
> (0.1964, 0.1968) # Different to what I was expecting
Why is this happening? Am I missing something?
Upvotes: 3
Views: 358
Reputation: 6333
This is really counter-intuitive, but it turns out that the alpha
parameter in the interval()
method is the probability of the distribution contained within the edges of the interval.
Hence, the correct way of calculating the 95% confidence interval is:
X.interval(alpha=0.95)
> (0.19109204017782955, 0.20210795982217045)
This goes against the standard nomenclature used in statistics, so I raised an issue on GitHub. Apparently it also causes name collisions with other methods.
Upvotes: 2