Reputation: 39
I try to perform an example of linear regression model in python. The aim is find a linear relationship among two features in my dataset, this features are 'Year' and 'Obesity (%)'. I want train my model to predict the future trend of obesity in the world. The problem is that my MSE is too high and R2 too low. How can improve my model?
This is the link where I found the data set; Obesity-cleaned.csv
CODE
#Analysis of obesity by country
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
address = 'C:/Users/Andre/Desktop/Python/firstMN/obesity-cleaned.csv'
dt = pd.read_csv(address)
#eliminate superfluos data
dt.drop(dt['Obesity (%)'][dt['Obesity (%)'].values == 'No data'].index, inplace=True)
for i in range(len(dt)):
dt['Obesity (%)'].values[i] = float(dt['Obesity (%)'].values[i].split()[0])
obMean = dt['Obesity (%)'].mean()
print('%0.3f' %obMean, '\n')
dt['Obesity (%)'] = dt['Obesity (%)'].astype(float) #converto il tipo in float
group = dt.groupby('Country')
print(group[['Year', 'Obesity (%)']].mean(), '\n')
dt1 = dt[dt['Sex'] == 'Both sexes']
print(dt1[dt1['Obesity (%)'] == dt1['Obesity (%)'].max()], '\n')
sb.lmplot('Year', 'Obesity (%)', dt1)
plt.show()
#linear regression predictions
group1 = dt1.groupby('Year')
x = np.array(np.linspace(1975, 2016, 2016-1975+1)).tolist()
y = np.array([group1['Obesity (%)'].mean()]).tolist()[0]
x1 = np.array([1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002 , 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016 ]).reshape((-1, 1))
y1 = np.array([group1['Obesity (%)'].mean()]).reshape(-1, 1)
lr = LinearRegression(fit_intercept=False)
lr.fit(x1, y1)
plt.plot(x, y)
plt.show()
print('Coefficients: ', lr.coef_)
print("Intercept: ", lr.intercept_ )
y_hat = lr.predict(x1)
print('MSE: ', sklearn.metrics.mean_squared_error(y_hat, y1))
print('R^2: ', lr.score(x1, y1) )
print('var: ', y1.var())
OUTPUT
Coefficients: [[0.00626604]]
Intercept: 0.0
MSE: 15.09451970012738
R^2: 0.03779706109503678
var: 15.687459567838905
Correlation among years and obesity (%) is: (0.9960492544111168, 1.0885274634054143e-43)
Upvotes: 1
Views: 2831
Reputation: 36
Remove the fit_intercept=False
in your code. If the true model intercept is truly zero, the intercept term will be approximately zero making it unnecessary to set fit_intercept
to False
. You're essentially constraining the model without, to my knowledge, any reason to do so (correct me if I'm wrong).
From the scikit-learn documentation on the linear regression:
Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
I didn't see anywhere where you centered the data. Thus, your results are flawed. To remedy the situation, simply remove fit_intercept=False
since it is True
by default.
Upvotes: 2
Reputation: 60318
Forcing fit_intercept=False
is a huge constraint for the model, and you should be sure that you know exactly what you are doing before deciding to do so.
Fitting without an intercept in simple linear regression practically means that, when our single feature X is 0, the response Y should be also 0; here, it means that in the "year 0" (whatever that may mean), the Obesity should also be 0. Given that, the poor results reported are hardly a surprise (ML is not magic, and it is certainly implied that we do include realistic assumptions in our models).
It's not clear here why you have decided to do so, but I highly doubt it is what you intended to do. You should remove this unnecessary constraint from your model.
Upvotes: 4