How to reduce MSE and improve R2 in Linear Regression model

Question

I try to perform an example of linear regression model in python. The aim is find a linear relationship among two features in my dataset, this features are 'Year' and 'Obesity (%)'. I want train my model to predict the future trend of obesity in the world. The problem is that my MSE is too high and R2 too low. How can improve my model?

This is the link where I found the data set; Obesity-cleaned.csv

CODE


#Analysis of obesity by country

import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

address = 'C:/Users/Andre/Desktop/Python/firstMN/obesity-cleaned.csv'
dt = pd.read_csv(address)

#eliminate superfluos data
dt.drop(dt['Obesity (%)'][dt['Obesity (%)'].values == 'No data'].index, inplace=True)  

for i in range(len(dt)):
   dt['Obesity (%)'].values[i] = float(dt['Obesity (%)'].values[i].split()[0])  

obMean = dt['Obesity (%)'].mean() 
print('%0.3f' %obMean, '
') 

dt['Obesity (%)'] = dt['Obesity (%)'].astype(float)  #converto il tipo in float 

group = dt.groupby('Country')


print(group[['Year', 'Obesity (%)']].mean(), '
') 

dt1 = dt[dt['Sex'] == 'Both sexes']   

print(dt1[dt1['Obesity (%)'] == dt1['Obesity (%)'].max()], '
')   

sb.lmplot('Year', 'Obesity (%)', dt1)
plt.show()

#linear regression predictions

group1 = dt1.groupby('Year')

x = np.array(np.linspace(1975, 2016, 2016-1975+1)).tolist() 
y = np.array([group1['Obesity (%)'].mean()]).tolist()[0]

x1 = np.array([1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002 , 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 ]).reshape((-1, 1))  
y1 = np.array([group1['Obesity (%)'].mean()]).reshape(-1, 1)     

lr = LinearRegression(fit_intercept=False)
lr.fit(x1, y1) 

plt.plot(x, y) 
plt.show() 

print('Coefficients: ', lr.coef_)  
print("Intercept: ", lr.intercept_ )

y_hat = lr.predict(x1)
print('MSE: ', sklearn.metrics.mean_squared_error(y_hat, y1)) 
print('R^2: ', lr.score(x1, y1) ) 
print('var: ', y1.var())

OUTPUT

Coefficients:  [[0.00626604]]
Intercept:  0.0
MSE:  15.09451970012738
R^2:  0.03779706109503678
var:  15.687459567838905 

Correlation among years and obesity (%) is:  (0.9960492544111168, 1.0885274634054143e-43)

Barrett Duna · Accepted Answer

Remove the fit_intercept=False in your code. If the true model intercept is truly zero, the intercept term will be approximately zero making it unnecessary to set fit_intercept to False. You're essentially constraining the model without, to my knowledge, any reason to do so (correct me if I'm wrong).

From the scikit-learn documentation on the linear regression:

Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).

I didn't see anywhere where you centered the data. Thus, your results are flawed. To remedy the situation, simply remove fit_intercept=False since it is True by default.

How to reduce MSE and improve R2 in Linear Regression model

Answers (2)

Related Questions