ARC
ARC

Reputation: 87

One Hot Encoding a single column

I am trying to use one hot encoder on the target column('Species') in the Iris dataset.

But I am getting the following errors:

ValueError: Expected 2D array, got 1D array instead:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm    Species
0   1   5.1 3.5 1.4         0.2     Iris-setosa
1   2   4.9 3.0 1.4         0.2     Iris-setosa
2   3   4.7 3.2 1.3         0.2     Iris-setosa
3   4   4.6 3.1 1.5         0.2     Iris-setosa
4   5   5.0 3.6 1.4         0.2     Iris-setosa

I did google the issue and i found that most of the scikit learn estimators need a 2D array rather than a 1D array.

At the same time, I also found that we can try passing the dataframe with its index to encode single columns, but it didn't work

onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')
X = dataset.iloc[:,1:5].values
y = dataset.iloc[:, 5].values

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder= LabelEncoder()
y = labelencoder.fit_transform(y)


onehotencoder = OneHotEncoder(categorical_features=[0])
y = onehotencoder.fit_transform(y)

I am trying to encode a single categorical column and split into multiple columns (the way the encoding usually works)

Upvotes: 4

Views: 25090

Answers (3)

Abhijith
Abhijith

Reputation: 124

I came across similar situation and found the below method to be working :

Use two square brackets for the column name in the fit or fit_transform command

one_hot_enc = OneHotEncoder()

arr =  one_hot_enc.fit_transform(data[['column']])
df = pd.DataFrame(arr)

The fit_transform gives you an array and you can convert this to pandas dataframe. You may append this to the original dataframe or directly assign to an existing column.

Upvotes: 6

Rorschach
Rorschach

Reputation: 32426

For your case, since it looks like you are using the kaggle dataset, I would just use

import pandas as pd
pd.get_dummies(df.Species).head()

Out[158]: 
   Iris-setosa  Iris-versicolor  Iris-virginica
0            1                0               0
1            1                0               0
2            1                0               0
3            1                0               0
4            1                0               0

Note that the default here encodes all the classes (3 species), where it is common to use just two and compare differences in the means to the baseline group, (eg. the default in R or typically when doing regression/ANOVA which can be accomplished using the drop_first argument).

Upvotes: 5

Arkady. A
Arkady. A

Reputation: 545

ValueError: Expected 2D array, got 1D array instead: Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Says that you need to convert your array to a vector. You can do that by:

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
import numpy as np

# load iris dataset 
>>> iris = datasets.load_iris()
>>> iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
>>> y = iris.target.values
>>> onehotencoder = OneHotEncoder(categories='auto')
>>> y = onehotencoder.fit_transform(y.reshape(-1,1))
# y - will be sparse matrix of type '<class 'numpy.float64'>
# if you want it to be a array you need to 
>>> print(y.toarray())
[[1. 0. 0.]
 [1. 0. 0.]
    . . . . 
 [0. 0. 1.]
 [0. 0. 1.]]

Also you can use get_dummies function (docs)

>>> pd.get_dummies(iris.target).head()
   0.0  1.0  2.0
0    1    0    0
1    1    0    0
2    1    0    0
3    1    0    0
4    1    0    0

Hope that helps!

Upvotes: 9

Related Questions