Reputation: 519
Let's say I have this dataframe:
df = pd.DataFrame({"a": [1,2,3], "b": ["d", "d", "d"]})
And I want to OneHotEncode both the "a" and "b" columns. But let's say that I know what the categories of the "a" column are: {1, 2, 3, 4, 5}, but I don't know what the categories for the "b" column are (and want them to be automatically inferred).
How can I use the default categories='auto'
behavior for only the "b" feature, but pass the categories for the "a" feature? Looks like OneHotEncode doesn't allow that: either you pass in 'auto' for all features or predefined categories for ALL features.
I would like to keep the encoder for future transforms and the capability to handle unknown/unseen categories like the way Sklearn's OHE does.
I tried passing categories=[[1,2,3,4,5], 'auto']
, categories=[[1,2,3,4,5], None]
, categories=[[1,2,3,4,5], []]
, but all of them errored out.
Function snipped
def one_hot_encode_categorical_columns(df, columns, categories="auto"):
ohe = OneHotEncoder(categories=categories, sparse=False, handle_unknown="ignore")
ohe_df = pd.DataFrame(ohe.fit_transform(df[columns]))
ohe_df.columns = ohe.get_feature_names_out(columns)
new_df = pd.concat([df, ohe_df], axis=1)
return ohe, new_df
df = pd.DataFrame({"a": [1,2,3], "b": ["d", "d", "d"]})
# call function here
Upvotes: 1
Views: 141
Reputation: 8768
using pd.CategoricalDtype()
and passing in the known values of [1,2,3,4,5]
should work
c = pd.CategoricalDtype(categories=[1,2,3,4,5])
pd.get_dummies(df.astype({'a':c})).astype(int)
Output:
a_1 a_2 a_3 a_4 a_5 b_d
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 0 1 0 0 1
Upvotes: 0
Reputation: 260430
What about using pure pandas here?
categories = {'a': [1, 2, 3, 4, 5]}
def dummies(s):
out = pd.get_dummies(s)
if s.name in categories:
return out.reindex(columns=categories[s.name], fill_value=0)
return out
out = pd.concat([dummies(df[x]).add_prefix(f'{x}_') for x in df], axis=1)
Output:
a_1 a_2 a_3 a_4 a_5 b_d
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 0 1 0 0 1
With original:
df.join(out)
a b a_1 a_2 a_3 a_4 a_5 b_d
0 1 d 1 0 0 0 0 1
1 2 d 0 1 0 0 0 1
2 3 d 0 0 1 0 0 1
Upvotes: 2