Reputation: 8008
I am trying to emulate the very simple example
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses
print(type(x),type(y))
print('training samples ',len(x),len(y))
plt.scatter(x, y, c=colors, alpha=0.5)
plt.show()
this shows
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
training samples 50 50
as expected and the plot shows up as well. Now I am trying to plot the results of GradientBoostingRegressor
as
base_regressor = GradientBoostingRegressor()
base_regressor.fit(X_train, y_train)
y_pred_base = base_regressor.predict(X_test)
print(type(X_train),type(y_train))
print('training samples ',len(X_train),len(y_train))
print(type(X_test),type(y_pred_base))
print('base samples ',len(X_test),len(y_pred_base))
plt.figure()
plt.scatter(X_train, y_train, c="k", label="training samples")
plt.plot(X_test, y_pred_base, c="g", label="n_estimators=1", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Base Regression")
plt.legend()
plt.show()
note that X_train
, y_train
, and X_test
are all numpy arrays. For the above code I get
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
training samples 74067 74067
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
base samples 166693 166693
but the plot does not show up and I get the error
ValueError: x and y must be the same size
at
plt.scatter(X_train, y_train, c="k", label="training samples")
but as seen in the output, x
and y
are of the same size and type. What am I doing wrong?
Upvotes: 1
Views: 161
Reputation: 1188
Your X_train array is 2-dimensional, with 163 columns for each sample. You can't plot your y_train array, which is only 1-dimensional, against the entire X_train array. Similarly for the y_pred_base plot against X_test.
You will have to choose one of the columns in the X arrays to plot against, editing your code something like this:
plt.scatter(X_train[:, 17], y_train, c="k", label="training samples")
plt.plot(X_test[:, 17], y_pred_base, c="g", label="n_estimators=1", linewidth=2)
Your dependent variables (X) live in a 163-dimensional space. Each y value depends on the corresponding x-value from each of these dimensions. A simple 2-dimensional scatter or line plot just can't display all of that information at once.
One thing you could do is find out which of the x variables your y values depend on most. You can access this with the base_regressor.feature_importances_
attribute. There's an example in the documentation here. Then you could make a plot against the most important ones. You could do this in multiple dimensions using a 3D scatter plot or in even higher dimensions with something like corner.py
Upvotes: 3