Amira Elsayed Ismail
Amira Elsayed Ismail

Reputation: 9414

how to make one hot encoding to column in data frame in python

I have dataset that includes categorial column for education level the initial values was 0,nan, high school, graduate school, university I have cleaned the data and convert it to the following values

0-> others 1-> high school 2-> graduate school 3-> university

in the same column, now I want to hot encode this column to 4 columns

I have tried to use scikit learn as following

onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(df_csv['EDUCATION'])
print(onehot_encoded)

and I got this error

ValueError: Expected 2D array, got 1D array instead:
array=[3 3 3 ... 3 1 3].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Upvotes: 0

Views: 687

Answers (2)

user6386471
user6386471

Reputation: 1263

For your specific case, if you reshape the underlying array (along with setting sparse=False) it will give you your one-hot encoded array:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'EDUCATION':['high school','high school','high school',
                                'university','university','university',
                                'graduate school', 'graduate school','graduate school',
                                'others','others','others']})

onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoder.fit_transform(df['EDUCATION'].to_numpy().reshape(-1,1))

>>>

array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]])

The most straightforward approach in my opinion is using pandas.get_dummies:

pd.get_dummies(df['EDUCATION'])

enter image description here

Upvotes: 1

meTchaikovsky
meTchaikovsky

Reputation: 7676

You need to set sparse to False

from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse=False)
y_train = np.random.randint(0,4,100)[:,None]
y_train = onehot_encoder.fit_transform(y_train)

Or, you can also do something like this

from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

y_train = np.random.randint(0,4,100)
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y = encoder.transform(y_train)
y_train = np_utils.to_categorical(encoded_y)

Upvotes: 1

Related Questions