AbtPst
AbtPst

Reputation: 8008

Pyplot cannot plot Regression

I am trying to emulate the very simple example

N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2  # 0 to 15 point radiuses

print(type(x),type(y))
print('training samples ',len(x),len(y))
plt.scatter(x, y, c=colors, alpha=0.5)
plt.show()

this shows

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
training samples  50 50

as expected and the plot shows up as well. Now I am trying to plot the results of GradientBoostingRegressor as

base_regressor = GradientBoostingRegressor()
base_regressor.fit(X_train, y_train)
y_pred_base = base_regressor.predict(X_test)

print(type(X_train),type(y_train))
print('training samples ',len(X_train),len(y_train))
print(type(X_test),type(y_pred_base))
print('base samples ',len(X_test),len(y_pred_base))

plt.figure()

plt.scatter(X_train, y_train, c="k", label="training samples")
plt.plot(X_test, y_pred_base, c="g", label="n_estimators=1", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Base Regression")
plt.legend()
plt.show()

note that X_train, y_train, and X_test are all numpy arrays. For the above code I get

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
training samples  74067 74067
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
base samples  166693 166693

but the plot does not show up and I get the error

ValueError: x and y must be the same size

at

plt.scatter(X_train, y_train, c="k", label="training samples")

but as seen in the output, x and y are of the same size and type. What am I doing wrong?

Upvotes: 1

Views: 161

Answers (1)

kiliantics
kiliantics

Reputation: 1188

Your X_train array is 2-dimensional, with 163 columns for each sample. You can't plot your y_train array, which is only 1-dimensional, against the entire X_train array. Similarly for the y_pred_base plot against X_test.

You will have to choose one of the columns in the X arrays to plot against, editing your code something like this:

plt.scatter(X_train[:, 17], y_train, c="k", label="training samples")
plt.plot(X_test[:, 17], y_pred_base, c="g", label="n_estimators=1", linewidth=2)

Your dependent variables (X) live in a 163-dimensional space. Each y value depends on the corresponding x-value from each of these dimensions. A simple 2-dimensional scatter or line plot just can't display all of that information at once.

One thing you could do is find out which of the x variables your y values depend on most. You can access this with the base_regressor.feature_importances_ attribute. There's an example in the documentation here. Then you could make a plot against the most important ones. You could do this in multiple dimensions using a 3D scatter plot or in even higher dimensions with something like corner.py

Upvotes: 3

Related Questions