Reputation: 137
I'm trying to plot a regression line on a scatterplot, based on my predicted data.
The problem is that I'm supposed to get a single line, but my plot has many lines connecting all points (see picture) https://i.sstatic.net/VF483.png
After predicting CO2 emissions based on the other data, I plot the test engine size vs the actual data of the test(co2emissions) and I'm trying to plot the line on the engine size vs the predicted data of the test, but I can't.
Here is the code:
#import the dataset
df = pd.read_csv('FuelConsumptionCo2.csv')
cols = ['ENGINESIZE','CYLINDERS','FUELTYPE','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','FUELCONSUMPTION_COMB','CO2EMISSIONS']
#create new dataset with colums neeeded
cdf = df[cols]
#dummies for the categorigal column fueltype
cdf = pd.get_dummies(cdf,'FUELTYPE')
#the features without the target column
selFeatures = list(cdf.columns.values)
del selFeatures[5]
#split the dataset for fitting
X_train, X_test, Y_train, Y_test = train_test_split(cdf[selFeatures], cdf['CO2EMISSIONS'], test_size=0.5)
#regression model
clfregr = linear_model.LinearRegression()
#train the model
clfregr.fit(X_train, Y_train)
#predict the values
train_pred = clfregr.predict(X_train)
test_pred = clfregr.predict(X_test)
#regression line for the predicted in test
plt.scatter(X_test.ENGINESIZE,Y_test, color='gray')
plt.plot(X_test.ENGINESIZE, test_pred, color='red', linewidth=1)
plt.show()
Upvotes: 1
Views: 1000
Reputation: 91
The problem is you are doing multiple linear regression. You should expect a straight line if Engine size is the only factor affecting CO2 emissions. But there are other factors too. If you have 2 independent variables, you will get a plane in 3D. If you have n variables, you should expect a linear shape in the n-dimensional metric space.
Upvotes: 4
Reputation: 31
You can apply this code to plot the regression model
model = linear_model.LinearRegression()
x_train = np.asanyarray(df[['ENGINESIZE']])
y_train = np.asanyarray(df[['CO2EMISSIONS']])
model.fit (x_train, y_train)
plt.scatter(df['ENGINESIZE'], df["CO2EMISSIONS"], color='blue')
plt.plot(x_train, model.coef_[0][0]*x_train + model.intercept_[0], color='red')
Upvotes: 3
Reputation: 63062
There are 9 independent variables in the data. Therefore plotting by just one of them you will end up with duplicates per ENGINESIZE
value. This does not result in a plottable function. When you attempt to draw a line it will zigzag among these multiple vertical points.
Notice when we do a scatterplot
on the predictions we have many in one vertical line -corresponding to different values of the other eight independent variables than the one you are plotting on the x-axis
:
plt.scatter(X_test.ENGINESIZE, test_pred, color='yello') # , linewidth=1)
I will say - the sklearn
LinearRegression
class is quite difficult to use. I used statsmodels
instead
plt.scatter(X_test.ENGINESIZE,Y_test, color='gray')
import statsmodels.formula.api as smf
y = Y_train
X = X_train
df = pd.DataFrame({'x' : X.ENGINESIZE, 'y': y})
smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
plt.plot(df['x'], result.predict(df['x']), color='red', linewidth=1)
plt.show()
Then for extra credit
print(result.summary())
Upvotes: 2
Reputation:
Try extracting the slope (m) and intercept (b) of the regression line from your LinearRegression()
function and then use
plt.plot(X_test.ENGINESIZE, m*X_test.ENGINESIZE + b, 'r', linewidth=1)
or use seaborn's lmplot
or regplot
function.
Upvotes: 2