CatsRule
CatsRule

Reputation: 21

SKLearn: Dummy Variables for Label Encoded Categorical Values

I begin by setting my X from the excel dataset and converting it into matrix values:

X = dataset.iloc[:, 3:13].values

So I have two columns for X I need to label encode (countries and gender). There are three countries, Spain, France, and Germany, and there are only two genders. I label encode them:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # the three countries
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

Okay, now I need create dummy variable for the three countries, since they don't exist in a hierarchical relationship with one value higher than other. However, the new code doesn't work:

onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

This code does not work. I read that ColumnTransformer with Onehotencoding is used now to create dummy variables, but I am having difficulty figuring it out. I did import necessary packages. I tried this, but it still does not work:

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = columnTransformer.fit_transform(X)

Can someone help? Thanks. I just want to hot encode the three countries in the beginning after they are label encoded.

Upvotes: 2

Views: 5274

Answers (1)

Vatsal Gupta
Vatsal Gupta

Reputation: 511

The easiest way you can get dummies is by using pandas get_dummies function. Here, you don't even need to Label encode your data.

df_country = pd.get_dummies(X[:, 1])
df_gender = pd.get_dummies(X[:, 2]

Now, you get two dataframes One hot encoded as per country and gender columns. You can now append it to the dataframe X and delete the original gender and country columns. X = pd.concat([X, df_country, df_gender], axis = 1)

Upvotes: 4

Related Questions