Reputation: 351

Possible ways to do one hot encoding in scikit-learn?

I have a pandas data frame with some categorical columns. Some of these contains non-integer values.

I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.

Therefore, I am looking for other methods to do one-hot coding.

What are possible ways to do one hot encoding in python (pandas/sklearn)?

Upvotes: 6

Answers (5)

jake.csc

Reputation: 11

Sklearn actually has a OneHotEndcoder that works pretty great.

from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse=False)
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1, 1))
val_labels_one_hot = one_hot_encoder.transform(val_df["target"].to_numpy().reshape(-1, 1))
test_labels_one_hot = one_hot_encoder.transform(test_df["target"].to_numpy().reshape(-1, 1))

Using just the transform function after first using the fit_transform function like above should keep your labels matching correctly.

Upvotes: 0

Jack Daniel

Reputation: 2611

For the text columns, you can try this

from sklearn.feature_extraction.text import CountVectorizer

data = ['he is good','he is bad','he is strong']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data)

For Output:

for i in range(len(data)):
    print(vectors[i, :].toarray())

Output:

[[0 1 1 1 0]]
[[1 0 1 1 0]]
[[0 0 1 1 1]]

Upvotes: 0

hume

Reputation: 2553

In the past, I've found the easiest way to deal with this problem is to use get_dummies and then enforce that the columns match up between test and train. For example, you might do something like:

import pandas as pd

train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)

# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)

# add these columns to test, setting them equal to zero
for c in col_to_add:
    test[c] = 0

# select and reorder the test columns using the train columns
test = test[train.columns]

This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFold for cross validation so that your splits contain the relevant labels.

Upvotes: 7

Arnab Biswas

Reputation: 4605

Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :

import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)

The output being

   A_a  A_b  A_c  A_d
0  1.0  0.0  0.0  0.0
1  0.0  1.0  0.0  0.0
2  0.0  0.0  1.0  0.0

Upvotes: 0

David Maust

Reputation: 8270

Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.

For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.

label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)

For the test data you can use the same set of categories using transform.

test_mat = label_binarizer.transform(test_df.Label)

Upvotes: 9

Possible ways to do one hot encoding in scikit-learn?

Answers (5)

Related Questions