WalksB
WalksB

Reputation: 519

OneHotEncoder - Predefined categories for SOME columns?

Let's say I have this dataframe:

df = pd.DataFrame({"a": [1,2,3], "b": ["d", "d", "d"]})

And I want to OneHotEncode both the "a" and "b" columns. But let's say that I know what the categories of the "a" column are: {1, 2, 3, 4, 5}, but I don't know what the categories for the "b" column are (and want them to be automatically inferred).

How can I use the default categories='auto' behavior for only the "b" feature, but pass the categories for the "a" feature? Looks like OneHotEncode doesn't allow that: either you pass in 'auto' for all features or predefined categories for ALL features.

I would like to keep the encoder for future transforms and the capability to handle unknown/unseen categories like the way Sklearn's OHE does.

I tried passing categories=[[1,2,3,4,5], 'auto'], categories=[[1,2,3,4,5], None], categories=[[1,2,3,4,5], []], but all of them errored out.


Function snipped

def one_hot_encode_categorical_columns(df, columns, categories="auto"):
    ohe = OneHotEncoder(categories=categories, sparse=False, handle_unknown="ignore")
    ohe_df = pd.DataFrame(ohe.fit_transform(df[columns]))
    ohe_df.columns = ohe.get_feature_names_out(columns)
    new_df = pd.concat([df, ohe_df], axis=1)
    return ohe, new_df

df = pd.DataFrame({"a": [1,2,3], "b": ["d", "d", "d"]})

# call function here

Upvotes: 1

Views: 141

Answers (2)

rhug123
rhug123

Reputation: 8768

using pd.CategoricalDtype() and passing in the known values of [1,2,3,4,5] should work

c = pd.CategoricalDtype(categories=[1,2,3,4,5])

pd.get_dummies(df.astype({'a':c})).astype(int)

Output:

   a_1  a_2  a_3  a_4  a_5  b_d
0    1    0    0    0    0    1
1    0    1    0    0    0    1
2    0    0    1    0    0    1

Upvotes: 0

mozway
mozway

Reputation: 260430

What about using pure pandas here?

categories = {'a': [1, 2, 3, 4, 5]}

def dummies(s):
    out = pd.get_dummies(s)
    if s.name in categories:
        return out.reindex(columns=categories[s.name], fill_value=0)
    return out

out = pd.concat([dummies(df[x]).add_prefix(f'{x}_') for x in df], axis=1)

Output:

   a_1  a_2  a_3  a_4  a_5  b_d
0    1    0    0    0    0    1
1    0    1    0    0    0    1
2    0    0    1    0    0    1

With original:

df.join(out)

   a  b  a_1  a_2  a_3  a_4  a_5  b_d
0  1  d    1    0    0    0    0    1
1  2  d    0    1    0    0    0    1
2  3  d    0    0    1    0    0    1

Upvotes: 2

Related Questions