Reputation: 73
I'm using statsmodel to do simple and multiple linear regression and I'm getting bad R^2 values from the summary. The coefficients look to be calculated correctly, but I get an R^2 of 1.000 which is impossible for my data. I graphed it in excel and I should be getting around 0.93, not 1.
I'm using a mask to filter data to send into the model and I'm wondering if that could be the issue, but to me the data looks fine. I am fairly new to python and statsmodel so maybe I'm missing something here.
import statsmodels.api as sm
for i, df in enumerate(fallwy_xy): # Iterate through list of dataframes
if len(df.index) > 0: # Check if frame is empty or not
mask3 = (df['fnu'] >= low) # Mask data below 'low' variable
valid3 = df[mask3]
if len(valid3) > 0: # Check if there is data in range of mask3
X = valid3[['logfnu', 'logdischarge']]
y = valid3[['logssc']]
estm = sm.OLS(y, X).fit()
X = valid3[['logfnu']]
y = valid3[['logssc']]
ests = sm.OLS(y, X).fit()
Upvotes: 2
Views: 2270
Reputation: 73
I finally found out what was going on. Statsmodels by default does not incorporate a constant into its OLS regression equation, you have to call it out specifically with
X = sm.add_constant(X)
The reason the constant is so important is because without it, Statsmodels calculates R-squared differently, uncentered to be exact. If you do add a constant then the R-squared gets calculated the way most people calculate R-squared which is the centered version. Excel does not change the way it calculates R-squared when given a constant or not which is why when Statsmodels reported it's R-squared with no constant it as so different from Excel. The OLS Regression summary from Statsmodels actually points out the calculation method if it uses the uncentered no-constant, calculation by showing R-squared (uncentered): where the R-squared shows up in the summary table. The below links helped me figure this out.
add hasconstant indicator for R-squared and df calculations
Same model coeffs, different R^2 with statsmodels OLS and sci-kit learn linearregression
Upvotes: 3