scipio1465
scipio1465

Reputation: 23

Hot-Encoding only on some elements of a column

On my dataset I have many columns with mixed categorical and numerical values. Basically when the numerical value was not available, a code is assigned, like 'M', 'C', etc.. associated to the reason it was missing.
They have special meaning and peculiar behavior, so I want to cast them as categorical, and keep the rest as numeric. Minimal example:

# Original df
ex1 = ['a', 'b', '0', '1', '2']
df = pd.DataFrame(ex1, columns=['CName'])
print(df)

CName
0     a
1     b
2     0
3     1
4     2

## What I want to achieve
df['CName_a'] = (df.CName == 'a').astype(int)
df['CName_b'] = (df.CName == 'b').astype(int)
ff = (df.CName == 'b') | (df.CName == 'a')
df['CNname_num'] = np.where(ff, np.NaN, df.CName)
df2 = df.drop('CName', axis=1)
print(df2)

   CName_a  CName_b CNname_num
0        1        0        NaN
1        0        1        NaN
2        0        0          0
3        0        0          1
4        0        0          2

Question 1.
Q1: How this can be done efficiently? Ideally I need to chain it in a Pipeline, some fit_transform kind ot thing? I have to write from scratch or there is a hack from common libraries to hot-encode a subset of a column, like ['a', 'b', 'else'] ?

Question 2.
Q2: How should I fill the 'Nan' for the CName_num? The categorical elements ('a' and 'b' in the example) have behavior that differ from the average of the numerical (actually from any of the numerical). I feel assign 0 or 'mean' is not the right choice, but I ran out of options. I plan to use Random Forest, DNN, or even Regression-like training if it performs decently.

Upvotes: 1

Views: 78

Answers (1)

Chris Adams
Chris Adams

Reputation: 18647

Here is one potential solution. First create a boolean mask using str.isdigit. Use pandas.get_dummies and pandas.concat for your final DataFrame:

mask = mask = df['CName'].str.isdigit()

pd.concat([pd.get_dummies(df.loc[~mask, 'CName'], prefix='CName')
             .reindex(df.index).fillna(0),
           df.loc[mask].add_suffix('_num')], axis=1)

[out]

   CName_a  CName_b CName_num
0      1.0      0.0       NaN
1      0.0      1.0       NaN
2      0.0      0.0         0
3      0.0      0.0         1
4      0.0      0.0         2

Upvotes: 1

Related Questions