Reputation: 47
I am trying following code:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
model = linear_model.LogisticRegression()
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
X=scaler.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
model.fit(X_train,y_train)
# Make predictions using the testing set
powerOutput_y_pred = model.predict(X_test)
print (powerOutput_y_pred)
# The coefficients
print('Coefficients: \n', model.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(y_test, powerOutput_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, powerOutput_y_pred))
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, powerOutput_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
But i am getting the following error for the scatter plot:
ValueError: x and y must be the same size
If i run df.head(), i get following table:
The features X and y are as below:
X=df.values[:,[0,1,2,3,4,5,7]]
y=df.values[:,6]
Running X.shape gives (25,7) and y.shape gives (25, ) as output. So how to fix this shape mismatch?
Upvotes: 2
Views: 1679
Reputation: 13999
Just use plot
instead of scatter
:
plt.plot(X_test, y_test, ls="none", marker='.', ms=12)
This will plot the different sets of x data all using the same single set of y data. This assumes that x.shape == (n,d)
and y.shape == (n,)
, as in your question above.
Loop over the columns of your x values, and call scatter
once for each column:
colors = plt.cm.viridis(np.linspace(0.0, 1.0, features))
for xcol,c in zip(X_test.T, colors):
plt.scatter(xcol, y_test, c=c)
Setting c
with the array colors
will make it so that each feature is plotted as a different color on the scatter plot. If do you want them all to be black, just replace the colors stuff above with c='black'
scatter
expects one list of x values and one list of y values. It's simplest if the x and y list are 1D. However you can also plot multiple sets of x and y data stored in 2D arrays, if those arrays have matching shape.
From the Matplotlib docs:
Fundamentally, scatter works with 1-D arrays; x, y, s, and c may be input as 2-D arrays, but within scatter they will be flattened.
A bit vague, but a dive into the Matplotlib source code confirms that the shapes of x and y have to match exactly. The code that handles shapes for plot
is more flexible, so for that function you can away get with using one set of y data for many sets of x data.
Normally plot
plots lines instead of dots, but you can turn lines off by setting ls
(ie linestyle
), and you can turn dots on by setting marker
. ms
(ie markersize
) controls the size of the dots.
The example you posted above won't run (X
and y
aren't defined), but here's a complete example with output:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn import datasets
from sklearn.model_selection import train_test_split
d = datasets.load_diabetes()
features = d.data.shape[1]
X = d.data[:50,:]
Y = d.target[:50]
sample_weight = np.random.RandomState(442).rand(Y.shape[0])
# split train, test for calibration
X_train, X_test, Y_train, Y_test, sw_train, sw_test = \
train_test_split(X, Y, sample_weight, test_size=0.9, random_state=442)
# use the plot function instead of scatter
# plot one set of y data against several sets of x data
plt.plot(X_test, Y_test, ls="none", marker='.', ms=12)
# call .scatter() multiple times in a loop
#colors = plt.cm.viridis(np.linspace(0.0, 1.0, features))
#for xcol,c in zip(X_test.T, colors):
# plt.scatter(xcol, Y_test, c=c)
output:
Upvotes: 2