How is the p value calculated for multiple variables in linear regression?

Question

I am wondering how the p value is calculated for various variables in a multiple linear regression. I am sure upon reading several resources that <5% indicates the variable is significant for the model. But how is the p value calculated for each and every variable in the multiple linear regression?

I tried to see the statsmodels summary using the summary() function. I can just see the values. I didn't find any resource on how p value for various variables in a multiple linear regression is calculated.

import statsmodels.api as sm
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e
model = sm.OLS(y, X)
results = model.fit() 
print(results.summary())

This question has no error but requires an intuition on how p value is calculated for various variables in a multiple linear regression.

Simon · Accepted Answer

Inferential statistics work by comparison to known distributions. In the case of regression, that distribution is typically the t-distribution

You'll notice that each variable has an estimated coefficient from which an associated t-statistic is calculated. x1 for example, has a t-value of -0.278. To get the p-value, we take that t-value, place it on the t-distribution, and calculate the probability of getting a value as extreme as the t-value you calculated. You can gain some intuition for this by noticing that the p-value column is called P>|t|

An additional wrinkle here is that the exact shape of the t-distribution depends on the degrees of freedom

So to calculate a p-value, you need 2 pieces of information: the t-statistic and the residual degrees of freedom of your model (97 in your case)

Taking x1 as an example, you can calculate the p-value in Python like this:

import scipy.stats
scipy.stats.t.sf(abs(-0.278), df=97)*2

0.78160405761659357

The same is done for each of the other predictors using their respective t-values

How is the p value calculated for multiple variables in linear regression?

Answers (1)

Related Questions