Water_Math
Water_Math

Reputation: 73

Why is statsmodels.api producing R^2 of 1.000?

I'm using statsmodel to do simple and multiple linear regression and I'm getting bad R^2 values from the summary. The coefficients look to be calculated correctly, but I get an R^2 of 1.000 which is impossible for my data. I graphed it in excel and I should be getting around 0.93, not 1.

I'm using a mask to filter data to send into the model and I'm wondering if that could be the issue, but to me the data looks fine. I am fairly new to python and statsmodel so maybe I'm missing something here.

import statsmodels.api as sm

    for i, df in enumerate(fallwy_xy):   # Iterate through list of dataframes
        if len(df.index) > 0:            # Check if frame is empty or not
            mask3 = (df['fnu'] >= low)   # Mask data below 'low' variable
            valid3 = df[mask3]
            if len(valid3) > 0:          #  Check if there is data in range of mask3
                X = valid3[['logfnu', 'logdischarge']]
                y = valid3[['logssc']]
                estm = sm.OLS(y, X).fit()
                X = valid3[['logfnu']]
                y = valid3[['logssc']]
                ests = sm.OLS(y, X).fit()

Upvotes: 2

Views: 2270

Answers (1)

Water_Math
Water_Math

Reputation: 73

I finally found out what was going on. Statsmodels by default does not incorporate a constant into its OLS regression equation, you have to call it out specifically with

X = sm.add_constant(X)

The reason the constant is so important is because without it, Statsmodels calculates R-squared differently, uncentered to be exact. If you do add a constant then the R-squared gets calculated the way most people calculate R-squared which is the centered version. Excel does not change the way it calculates R-squared when given a constant or not which is why when Statsmodels reported it's R-squared with no constant it as so different from Excel. The OLS Regression summary from Statsmodels actually points out the calculation method if it uses the uncentered no-constant, calculation by showing R-squared (uncentered): where the R-squared shows up in the summary table. The below links helped me figure this out.

add hasconstant indicator for R-squared and df calculations

Same model coeffs, different R^2 with statsmodels OLS and sci-kit learn linearregression

Warning : Rod Made a Mistake!

Upvotes: 3

Related Questions