Vijay
Vijay

Reputation: 320

How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?

This is the dataset with 3 cols and 3 rows

Name Organization Department

Manie   ABC2 FINANCE

Joyce   ABC1 HR

Ami   NSV2 HR

This is the code I have:

Now it is fine till here, how do i drop the first dummy variable column for each ?

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values


# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()

Upvotes: 18

Views: 54350

Answers (7)

Hardik Kamboj
Hardik Kamboj

Reputation: 81

I use my own module for dealing with one hot encoding.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class My_encoder(BaseEstimator, TransformerMixin):
   
    def __init__(self,drop = 'first',sparse=False):
        self.encoder = OneHotEncoder(drop = drop,sparse = sparse)
        self.features_to_encode = []
        self.columns = []
    
    def fit(self,X_train,features_to_encode):
        
        data = X_train.copy()
        self.features_to_encode = features_to_encode
        data_to_encode = data[self.features_to_encode]
        self.columns = pd.get_dummies(data_to_encode,drop_first = True).columns
        self.encoder.fit(data_to_encode)
        return self.encoder
    
    def transform(self,X_test):
        
        data = X_test.copy()
        data.reset_index(drop = True,inplace =True)
        data_to_encode = data[self.features_to_encode]
        data_left = data.drop(self.features_to_encode,axis = 1)
        
        data_encoded = pd.DataFrame(self.encoder.transform(data_to_encode),columns = self.columns)
        
        return pd.concat([data_left,data_encoded],axis = 1)

Its pretty easy to use

features_to_encode = [---list of features to one hot encode--]
enc = My_encoder()
enc.fit(X_train,features_to_encode)
X_train = enc.transform(X_train)
X_test = enc.transform(X_test)

It returns dataframe with columns names. So, covers both the disadvantages of OneHotEncoder and pd.get_dummies(). So, we can use it to fit and transform, like OneHotEncoder, and also it saves us the column names and returns a datafram like dummies approach.

Upvotes: 1

Joseph Puthumana
Joseph Puthumana

Reputation: 69

Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer.

Another great example of implementation of this is provided here.

Upvotes: 0

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in range(Y.shape[1]):
    Y[:,i] = le.fit_transform(Y[:,i])

Upvotes: -2

Jyoti Prasad Pal
Jyoti Prasad Pal

Reputation: 1629

It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')

Upvotes: 3

MD Rijwan
MD Rijwan

Reputation: 481

I use my own template for doing that:

from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):

    def __init__(self):
        """Encode the data.

        Columns of data type object are appended in the list. After 
        appending Each Column of type object are taken dummies and 
        successively removed and two Dataframes are concated again.

        """
    def fit(self, X, y=None):
        self.object_col = []
        for col in X.columns:
            if(X[col].dtype == np.dtype('O')):
                self.object_col.append(col)
        return self

    def transform(self, X, y=None):
        dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
        X = X.drop(X[self.object_col],axis=1)
        X = pd.concat([dummy_df,X],axis=1)
        return X

And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:

from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)

And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.

Upvotes: 5

Robert
Robert

Reputation: 259

Encode the categorical variables one at a time. The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this:

X = X[:, 1:]

Then encode and repeat the next variable.

Upvotes: 0

Max Power
Max Power

Reputation: 8954

import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
                   'Org':  ['ABC2', 'ABC1', 'NSV2'],
                   'Dept': ['Finance', 'HR', 'HR']        
        })


df_2 = pd.get_dummies(df,drop_first=True)

test:

print(df_2)
   Dept_HR  Org_ABC2  Org_NSV2  name_Joyce  name_Manie
0        0         1         0           0           1
1        1         0         0           1           0
2        1         0         1           0           0 

UPDATE regarding your error with pd.get_dummies(X, columns =[1:]:

Per the documentation page, the columns parameter takes "Column Names". So the following code would work:

df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)

output:

    name  Org_ABC2  Org_NSV2  Dept_HR
0  Manie         1         0        0
1  Joyce         0         0        1
2    Ami         0         1        1

If you really want to define your columns positionally, you could do it this way:

column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)

Upvotes: 25

Related Questions