Reputation: 51
I am performing a linear regression on Python to predict Stock distributed at various sites in Ivory Coast. I have data from 2016 to 2019 September which looks like . The columns are
.
I used label encoding on the site code. There are 156 different sites and each one is labeled from 0-155. Similarly, I used Get_dummies function to get 11 different columns for 11 different product codes.
I then used linear regression to help predict the output and to my surprise, the R-squared value is 100%.
Code:
lm=sm.OLS(df_logistics_new_onehot_label['stock_distributed'],df_logistics_new_onehot_label[['intercept','year', 'month','site_code', 'stock_initial',
'stock_received', 'stock_adjustment', 'stock_end',
'average_monthly_consumption', 'stock_stockout_days', 'stock_ordered',
'site_latitude', 'site_longitude',
'product_code_AS21126', 'product_code_AS27000',
'product_code_AS27132', 'product_code_AS27133', 'product_code_AS27134',
'product_code_AS27137', 'product_code_AS27138', 'product_code_AS27139',
'product_code_AS42018', 'product_code_AS46000',
'site_type_Health Center',
'site_type_University Hospital/National Institute']])
results=lm.fit()
results.summary()
The output of regression looks like this
I further split the data into training and testing
X=df_logistics_new_onehot_label[['intercept','year', 'month','site_code', 'stock_initial',
'stock_received', 'stock_adjustment', 'stock_end',
'average_monthly_consumption', 'stock_stockout_days', 'stock_ordered',
'site_latitude', 'site_longitude',
'product_code_AS21126', 'product_code_AS27000',
'product_code_AS27132', 'product_code_AS27133', 'product_code_AS27134',
'product_code_AS27137', 'product_code_AS27138', 'product_code_AS27139',
'product_code_AS42018', 'product_code_AS46000',
'site_type_Health Center',
'site_type_University Hospital/National Institute']]
y=df_logistics_new_onehot_label['stock_distributed']
X_train, X_test, y_train, y_test = train_test_split(X, y ,test_size = 0.2, shuffle=False)
clf=LinearRegression()
clf.predict(X_test)
The Output from the Linear regression on the 20% data matches exactly the "stock distributed" variable as you can see here
Is the model overfitting or am I doing something wrong?
Upvotes: 0
Views: 684
Reputation: 1604
Your goal variable has perfect correlation with the following columns:
Also logically it makes sense that these are correlated. Try removing these mentioned columns first and then try the linear regression again.
Upvotes: 3