Apostolis Kennedy
Apostolis Kennedy

Reputation: 137

Regression Line plot

I'm trying to plot a regression line on a scatterplot, based on my predicted data.

The problem is that I'm supposed to get a single line, but my plot has many lines connecting all points (see picture) https://i.sstatic.net/VF483.png

After predicting CO2 emissions based on the other data, I plot the test engine size vs the actual data of the test(co2emissions) and I'm trying to plot the line on the engine size vs the predicted data of the test, but I can't.

Here is the code:

#import the dataset
df = pd.read_csv('FuelConsumptionCo2.csv')
cols = ['ENGINESIZE','CYLINDERS','FUELTYPE','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','FUELCONSUMPTION_COMB','CO2EMISSIONS']

#create new dataset with colums neeeded
cdf = df[cols]
#dummies for the categorigal column fueltype
cdf = pd.get_dummies(cdf,'FUELTYPE')

#the features without the target column
selFeatures = list(cdf.columns.values)
del selFeatures[5]


#split the dataset for fitting
X_train, X_test, Y_train, Y_test = train_test_split(cdf[selFeatures], cdf['CO2EMISSIONS'], test_size=0.5)

#regression model
clfregr = linear_model.LinearRegression()

#train the model
clfregr.fit(X_train, Y_train)

#predict the values
train_pred = clfregr.predict(X_train)
test_pred = clfregr.predict(X_test)

#regression line for the predicted in test
plt.scatter(X_test.ENGINESIZE,Y_test,  color='gray')
plt.plot(X_test.ENGINESIZE, test_pred, color='red', linewidth=1)
plt.show()

Upvotes: 1

Views: 1000

Answers (4)

Suhas Gumma
Suhas Gumma

Reputation: 91

The problem is you are doing multiple linear regression. You should expect a straight line if Engine size is the only factor affecting CO2 emissions. But there are other factors too. If you have 2 independent variables, you will get a plane in 3D. If you have n variables, you should expect a linear shape in the n-dimensional metric space.

Upvotes: 4

ehsan jamshidi
ehsan jamshidi

Reputation: 31

You can apply this code to plot the regression model

model = linear_model.LinearRegression()
x_train = np.asanyarray(df[['ENGINESIZE']])
y_train = np.asanyarray(df[['CO2EMISSIONS']])
model.fit (x_train, y_train)


plt.scatter(df['ENGINESIZE'], df["CO2EMISSIONS"], color='blue')
plt.plot(x_train, model.coef_[0][0]*x_train + model.intercept_[0], color='red')

enter image description here

Upvotes: 3

WestCoastProjects
WestCoastProjects

Reputation: 63062

There are 9 independent variables in the data. Therefore plotting by just one of them you will end up with duplicates per ENGINESIZE value. This does not result in a plottable function. When you attempt to draw a line it will zigzag among these multiple vertical points.

enter image description here

Notice when we do a scatterplot on the predictions we have many in one vertical line -corresponding to different values of the other eight independent variables than the one you are plotting on the x-axis:

 plt.scatter(X_test.ENGINESIZE, test_pred, color='yello') # , linewidth=1)

enter image description here

I will say - the sklearn LinearRegression class is quite difficult to use. I used statsmodels instead

plt.scatter(X_test.ENGINESIZE,Y_test,  color='gray')
import statsmodels.formula.api  as smf
y = Y_train
X = X_train
df = pd.DataFrame({'x' : X.ENGINESIZE, 'y': y})
smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
plt.plot(df['x'], result.predict(df['x']), color='red', linewidth=1)
plt.show()

enter image description here

Then for extra credit

print(result.summary())

enter image description here

Upvotes: 2

user12460726
user12460726

Reputation:

Try extracting the slope (m) and intercept (b) of the regression line from your LinearRegression() function and then use

plt.plot(X_test.ENGINESIZE, m*X_test.ENGINESIZE + b, 'r', linewidth=1)

or use seaborn's lmplot or regplot function.

Upvotes: 2

Related Questions