bbartling
bbartling

Reputation: 3502

regression model statsmodel python

This is more of a stats question as the code is working fine, but I am learning regression modeling in python. I have some code below with statsmodel to create a simple linear regression model:

import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

ng = pd.read_csv('C:/Users/ben/ngDataBaseline.csv',  thousands=',', index_col='Date', parse_dates=True)

X = ng['HDD']
y = ng['Therm']

# Note the difference in argument order
model = sm.OLS(y, X).fit()

# Print out the statistics
model.summary()

I get an output like the screen shot below. I am trying judge the goodness of fit, and I know the R^2 is high, but is it possible to find the root mean squared error (RMSE) of the prediction with statsmodel?

I'm also attempting to research if I can estimate the sampling distribution with a confidence interval. If I am interpreting the table correctly for the intercept HDD 5.9309, with standard error 0.220, p value low 0.000, and I think a 97.5% confidence interval the value of HDD (or is it my dependent variable Therm?) will be between 5.489 and 6.373?? Or I think in percentage that could be expressed as ~ +- 0.072%

enter image description here

EDIT included multiple regression table enter image description here

Upvotes: 3

Views: 20002

Answers (1)

Scratch'N'Purr
Scratch'N'Purr

Reputation: 10399

Is it possible to calculate the RMSE with statsmodels? Yes, but you'll have to first generate the predictions with your model and then use the rmse method.

from statsmodels.tools.eval_measures import rmse

# fit your model which you have already done

# now generate predictions
ypred = model.predict(X)

# calc rmse
rmse = rmse(y, ypred)

As for interpreting the results, HDD isn't the intercept. It's your independent variable. The coefficient (e.g. the weight) is 5.9309 with standard error of 0.220. The t-score for this variable is really high suggesting that it is a good predictor, and since it is high, the p-value is very small (close to 0).

The 5.489 and 6.373 values are your confidence bounds for a 95% confidence interval. The bounds are simply calculated based on adding or subtracting the standard error times the t-statistic associated with the 95% confidence interval from the coefficient.

The t-statistic is dependent on your sample size which in your case is 53, so your degrees of freedom is 52. Using a t-table, this means that for df=52 and a confidence level of 95%, the t-statistic is 2.0066. Therefore the bounds can be manually calculated as thus:

lower: 5.9309 - (2.0066 x 0.220) = 5.498
upper: 5.9309 + (2.0066 x 0.220) = 6.372

Of course, there's some precision loss due to rounding but you can see the manual calculation is really close to what's reported in the summary.

Additional response to your comments:

There are several metrics you can use to evaluate the goodness of fit. One of them being the adjusted R-squared statistic. Others are RMSE, F-statistic, or AIC/BIC. It's up to you to decide which metric or metrics to use to evaluate the goodness of fit. For me, I usually use the adjusted R-squared and/or RMSE, though RMSE is more of a relative metric to compare against other models.

Now looking at your model summaries, both of the models fit well, especially the first model given the high adjusted R-squared value. There may be potential improvement with the second model (may try different combinations of the independent variables) but you won't know unless you experiment. Ultimately, there's no right and wrong model. It just comes down to building several models and comparing them to get the best one. I'll also link an article that explains some of the goodness of fit metrics for regression models.

As for confidence intervals, I'll link this SO post since the person who answered the question has code to create the confidence interval. You'll want to look at the predict_mean_ci_low and predict_mean_ci_high that he created in his code. These two variables will give you the confidence intervals at each observation and from there, you can calculate the +/- therms/kWh by subtracting the lower CI from your prediction or subtracting your prediction from the upper CI.

Upvotes: 6

Related Questions