Makaroniiii
Makaroniiii

Reputation: 348

Removing columns with sklearn's OneHotEncoder

from sklearn.preprocessing import LabelEncoder as LE, OneHotEncoder as OHE
import numpy as np

a = np.array([[0,1,100],[1,2,200],[2,3,400]])


oh = OHE(categorical_features=[0,1])
a = oh.fit_transform(a).toarray()

Let's assume first and second column are categorical data. This code does one hot encoding, but for the regression problem, I would like to remove first column after encoding categorical data. In this example, there are two and I could do it manually. But what if you have many categorical features, how would you solve this problem?

Upvotes: 3

Views: 8797

Answers (4)

Marcus V.
Marcus V.

Reputation: 6859

For that I use a Wrapper like that which is also usable in Pipelines:

class DummyEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, n_values='auto'):
        self.n_values = n_values

    def transform(self, X):
        ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
        return ohe.fit_transform(X)[:,:-1]

    def fit(self, X, y=None, **fit_params):
        return self

Upvotes: 5

Mesfin D
Mesfin D

Reputation: 1

This is one of the limitation of the One-Hot encoder in Sklearn when dealing with building models. The best way to do that if you have multiple categorical variables is first to use LabelEncoder to identify the unique labels for each categorical variables and then utilize them to generate the indexes to delete. As an example, if you have the data in a numpy array as X, with categorical variables in FIRST_IDX, SECOND_IDX, THIRD_IDX columns, first encode them using LabelEncoder.

labelencoder_X_1 = LabelEncoder()
X[:, FIRST_IDX] = labelencoder_X_1.fit_transform(X[:, FIRST_IDX])

labelencoder_X_2 = LabelEncoder()
X[:, SECOND_IDX] = labelencoder_X_2.fit_transform(X[:, SECOND_IDX])

labelencoder_X_3 = LabelEncoder()
X[:, THIRD_IDX] = labelencoder_X_3.fit_transform(X[:, THIRD_IDX])

Then apply One-Hot Encoder, which will create the representation at the beginning of the array for all categorical variables, one after the other.

onehotencoder = OneHotEncoder(categorical_features=[FIRST_IDX, SECOND_IDX, THIRD_IDX])

X = onehotencoder.fit_transform(X).toarray()

Finally, eliminate the first entry for each categorical variable by leveraging the size of unique values for each categorical variables and using the cumulative sum (here the cumulative sum gives you the first entry index of the categorical variables) in numpy.

index_to_delete = np.cumsum([0,
               len(labelencoder_X_1.classes_),
               len(labelencoder_X_2.classes_),
               len(labelencoder_X_3.classes_)
               ])
index_to_keep = [i for i in range(X.shape[1]) if i not in index_to_delete]

X = X[:, index_to_keep]

Now X contains the data ready to be used in any modeling task.

Upvotes: 0

Tonnam Balankura
Tonnam Balankura

Reputation: 123

To do this automatically, we get a list of indices to drop before applying one hot encoding by identifying the most common levels in the categorical features. This is because the most common level serves best as the base level, allowing the importance of other levels to be evaluated.

After applying one hot encoding, we get the list of indices to keep and use it drop the previously determined columns.

from sklearn.preprocessing import OneHotEncoder as OHE
import numpy as np
import pandas as pd

a = np.array([[0,1,100],[1,2,200],[2,3,400]])

def get_indices_to_drop(X_before_OH, categorical_indices_list):
    # Returns list of index to drop after doing one hot encoding
    # Dropping most common level within the categorical variable
    # This is because the most common level serves best as the base level,
    # Allowing the importance of other levels to be evaluated
    indices_to_drop = []
    indices_accum = 0
    for i in categorical_indices_list:
        most_common = pd.Series(X_before_OH[:,i]).value_counts().index[0]
        indices_to_drop.append(most_common + indices_accum)
        indices_accum += len(np.unique(X_before_OH[:,i])) - 1
    return indices_to_drop

indices_to_drop = get_indices_to_drop(a, [0, 1])

oh = OHE(categorical_features=[0,1])
a = oh.fit_transform(a).toarray()

def get_indices_to_keep(X_after_OH, index_to_drop_list):
    return [i for i in range(X_after_OH.shape[-1]) if i not in index_to_drop_list]

indices_to_keep = get_indices_to_keep(a, indices_to_drop)
a = a[:, indices_to_keep]

Upvotes: 1

cs95
cs95

Reputation: 402363

You can use numpy's fancy indexing and slice off the first column:

>>> a
array([[   1.,    0.,    0.,    1.,    0.,    0.,  100.],
       [   0.,    1.,    0.,    0.,    1.,    0.,  200.],
       [   0.,    0.,    1.,    0.,    0.,    1.,  400.]])
>>> a[:, 1:]
array([[   0.,    0.,    1.,    0.,    0.,  100.],
       [   1.,    0.,    0.,    1.,    0.,  200.],
       [   0.,    1.,    0.,    0.,    1.,  400.]])

If you have an list of columns you want to delete, here's how you'd do that:

>>> idx_to_delete = [0, 3]
>>> indices = [i for i in range(a.shape[-1]) if i not in idx_to_delete]
>>> indices
[1, 2, 4, 5, 6]
>>> a[:, indices]
array([[   0.,    0.,    0.,    0.,  100.],
       [   1.,    0.,    1.,    0.,  200.],
       [   0.,    1.,    0.,    1.,  400.]])

Upvotes: 1

Related Questions