Rithwik Sarma
Rithwik Sarma

Reputation: 51

Linear regression R-squared is 1.0

I am performing a linear regression on Python to predict Stock distributed at various sites in Ivory Coast. I have data from 2016 to 2019 September which looks like DATA. The columns are Dataset_info. I used label encoding on the site code. There are 156 different sites and each one is labeled from 0-155. Similarly, I used Get_dummies function to get 11 different columns for 11 different product codes. I then used linear regression to help predict the output and to my surprise, the R-squared value is 100%. Code:


lm=sm.OLS(df_logistics_new_onehot_label['stock_distributed'],df_logistics_new_onehot_label[['intercept','year', 'month','site_code', 'stock_initial',
       'stock_received', 'stock_adjustment', 'stock_end',
       'average_monthly_consumption', 'stock_stockout_days', 'stock_ordered',
       'site_latitude', 'site_longitude',
       'product_code_AS21126', 'product_code_AS27000',
       'product_code_AS27132', 'product_code_AS27133', 'product_code_AS27134',
       'product_code_AS27137', 'product_code_AS27138', 'product_code_AS27139',
       'product_code_AS42018', 'product_code_AS46000',
       'site_type_Health Center',
       'site_type_University Hospital/National Institute']])

results=lm.fit()
results.summary()

The output of regression looks like thisRegression Output

I further split the data into training and testing

X=df_logistics_new_onehot_label[['intercept','year', 'month','site_code', 'stock_initial',
       'stock_received', 'stock_adjustment', 'stock_end',
       'average_monthly_consumption', 'stock_stockout_days', 'stock_ordered',
       'site_latitude', 'site_longitude',
       'product_code_AS21126', 'product_code_AS27000',
       'product_code_AS27132', 'product_code_AS27133', 'product_code_AS27134',
       'product_code_AS27137', 'product_code_AS27138', 'product_code_AS27139',
       'product_code_AS42018', 'product_code_AS46000',
       'site_type_Health Center',
       'site_type_University Hospital/National Institute']]
y=df_logistics_new_onehot_label['stock_distributed']

X_train, X_test, y_train, y_test = train_test_split(X, y ,test_size = 0.2, shuffle=False)

clf=LinearRegression()

clf.predict(X_test)

The Output from the Linear regression on the 20% data matches exactly the "stock distributed" variable as you can see hereOutput V/s data Is the model overfitting or am I doing something wrong?

Upvotes: 0

Views: 684

Answers (1)

drops
drops

Reputation: 1604

Your goal variable has perfect correlation with the following columns:

  • 'stock_initial',
  • 'stock_received'
  • 'stock_adjustment'
  • 'stock_end'

Also logically it makes sense that these are correlated. Try removing these mentioned columns first and then try the linear regression again.

Upvotes: 3

Related Questions