Reputation: 81
I am using sklearn pipelines to perform one-hot encoding:
preprocess = make_column_transformer(
(MinMaxScaler(),numeric_cols),
(OneHotEncoder(),['country'])
)
param_grid = {
'xgbclassifier__learning_rate': [0.01,0.005,0.001],
}
model = make_pipeline(preprocess,XGBClassifier())
# Initialize Grid Search Modelg
model = GridSearchCV(model,param_grid = param_grid,scoring = 'roc_auc',
verbose= 1,iid= True,
refit = True,cv = 3)
model.fit(X_train,y_train)
To see then how the countries are one hot encoded I get the following ( I know there are two)
pd.DataFrame(preprocess.fit_transform(X_test))
The result of this is:
A few questions:
Upvotes: 1
Views: 56
Reputation: 25199
To help you better understand (1), i.e. how OHE
works.
Suppose you have 1 column with categorical data:
df = pd.DataFrame({"categorical": ["a","b","a"]})
print(df)
categorical
0 a
1 b
2 a
Then you'll get one 1
per row (this will always be true for one column categorical data), but not necessarily on a per column basis:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(df)
ohe_out = ohe.transform(df).todense()
# ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(df.columns))
ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(["categorical"]))
print(ohe_df)
categorical_a categorical_b
0 1.0 0.0
1 0.0 1.0
2 1.0 0.0
Should you add more data columns, e.g. a numerical column, this will hold true on a per column basis, but not for the whole row anymore:
df = pd.DataFrame({"categorical":["a","b","a"],"nums":[0,1,0]})
print(df)
categorical nums
0 a 0
1 b 1
2 a 0
ohe.fit(df)
ohe_out = ohe.transform(df).todense()
# ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(df.columns))
ohe_df = pd.DataFrame(ohe_out, columns=ohe.get_feature_names(["categorical","nums"]))
print(ohe_df)
categorical_a categorical_b nums_0 nums_1
0 1.0 0.0 1.0 0.0
1 0.0 1.0 0.0 1.0
2 1.0 0.0 1.0 0.0
Upvotes: 1