Reputation: 21
I begin by setting my X from the excel dataset and converting it into matrix values:
X = dataset.iloc[:, 3:13].values
So I have two columns for X I need to label encode (countries and gender). There are three countries, Spain, France, and Germany, and there are only two genders. I label encode them:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # the three countries
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
Okay, now I need create dummy variable for the three countries, since they don't exist in a hierarchical relationship with one value higher than other. However, the new code doesn't work:
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]
This code does not work. I read that ColumnTransformer with Onehotencoding is used now to create dummy variables, but I am having difficulty figuring it out. I did import necessary packages. I tried this, but it still does not work:
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = columnTransformer.fit_transform(X)
Can someone help? Thanks. I just want to hot encode the three countries in the beginning after they are label encoded.
Upvotes: 2
Views: 5274
Reputation: 511
The easiest way you can get dummies is by using pandas get_dummies function. Here, you don't even need to Label encode your data.
df_country = pd.get_dummies(X[:, 1])
df_gender = pd.get_dummies(X[:, 2]
Now, you get two dataframes One hot encoded as per country and gender columns. You can now append it to the dataframe X and delete the original gender and country columns. X = pd.concat([X, df_country, df_gender], axis = 1)
Upvotes: 4