Klausos Klausos
Klausos Klausos

Reputation: 16050

A column-vector y was passed when a 1d array was expected

I need to fit RandomForestRegressor from sklearn.ensemble.

forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)

This code always worked until I made some preprocessing of data (train_y). The error message says:

DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

model = forest.fit(train_fold, train_y)

Previously train_y was a Series, now it's numpy array (it is a column-vector). If I apply train_y.ravel(), then it becomes a row vector and no error message appears, through the prediction step takes very long time (actually it never finishes...).

In the docs of RandomForestRegressor I found that train_y should be defined as y : array-like, shape = [n_samples] or [n_samples, n_outputs] Any idea how to solve this issue?

Upvotes: 249

Views: 362300

Answers (10)

logn
logn

Reputation: 150

sklearn operates with pandas.dataframe

for example the sklearn method train_test_split returns

x,y where x is a dataframe and y is a Pandas Series.

If it changed because of fitting or what ever, you get something like your array

I assume your array could

looks like [[1,2,3,4,5],["column_name"]] or just a plain array [1,2,3,4,5]

use y= train_y y = pd.series(array) or y = pd.series(array[0])

which was originally expected as input.

Upvotes: 0

Marcel H.
Marcel H.

Reputation: 307

TL;DR
use

y = np.squeeze(y)

instead of

y = y.ravel()

As Python's ravel() may be a valid way to achieve the desired results in this particular case, I would, however, recommend using numpy.squeeze().
The problem here is, that if the shape of your y (numpy array) is e.g. (100, 2), then y.ravel() will concatenate the two variables on the second axis along the first axis, resulting in a shape like (200,). This might not be what you want when dealing with independent variables that have to be regarded on their own.
On the other hand, numpy.squeeze() will just trim any redundant dimensions (i.e. which are of size 1). So, if your numpy array's shape is (100, 1), this will result in an array of shape (100,), whereas the result for a numpy array of shape (100, 2) will not change, as none of the dimensions have size 1.

Upvotes: 2

Linda MacPhee-Cobb
Linda MacPhee-Cobb

Reputation: 7856

Change this line:

model = forest.fit(train_fold, train_y)

to:

model = forest.fit(train_fold, train_y.values.ravel())

Explanation:

.values will give the values in a numpy array (shape: (n,1))

.ravel will convert that array shape to (n, ) (i.e. flatten it)

Upvotes: 402

Jeyakeethan Geethan
Jeyakeethan Geethan

Reputation: 79

Y = y.values[:,0]

Y - formated_train_y

y - train_y

Upvotes: 4

Bibby Wang
Bibby Wang

Reputation: 11

format_train_y=[]
for n in train_y:
    format_train_y.append(n[0])

Upvotes: 1

AlexB
AlexB

Reputation: 3548

With neuraxle, you can easily solve this :

p = Pipeline([
   # expected outputs shape: (n, 1)
   OutputTransformerWrapper(NumpyRavel()), 
   # expected outputs shape: (n, )
   RandomForestRegressor(**RF_tuned_parameters)
])

p, outputs = p.fit_transform(data_inputs, expected_outputs)

Neuraxle is a sklearn-like framework for hyperparameter tuning and AutoML in deep learning projects !

Upvotes: 2

I had the same problem. The problem was that the labels were in a column format while it expected it in a row. use np.ravel()

knn.score(training_set, np.ravel(training_labels))

Hope this solves it.

Upvotes: 22

sushmit
sushmit

Reputation: 4603

Another way of doing this is to use ravel

model = forest.fit(train_fold, train_y.values.reshape(-1,))

Upvotes: 3

Simon  Leung
Simon Leung

Reputation: 281

I also encountered this situation when I was trying to train a KNN classifier. but it seems that the warning was gone after I changed:
knn.fit(X_train,y_train)
to
knn.fit(X_train, np.ravel(y_train,order='C'))

Ahead of this line I used import numpy as np.

Upvotes: 28

Soumyaansh
Soumyaansh

Reputation: 8988

use below code:

model = forest.fit(train_fold, train_y.ravel())

if you are still getting slap by error as identical as below ?

Unknown label type: %r" % y

use this code:

y = train_y.ravel()
train_y = np.array(y).astype(int)
model = forest.fit(train_fold, train_y)

Upvotes: 14

Related Questions