Hot-Encoding only on some elements of a column

Question

On my dataset I have many columns with mixed categorical and numerical values. Basically when the numerical value was not available, a code is assigned, like 'M', 'C', etc.. associated to the reason it was missing.
They have special meaning and peculiar behavior, so I want to cast them as categorical, and keep the rest as numeric. Minimal example:

# Original df
ex1 = ['a', 'b', '0', '1', '2']
df = pd.DataFrame(ex1, columns=['CName'])
print(df)

CName
0     a
1     b
2     0
3     1
4     2

## What I want to achieve
df['CName_a'] = (df.CName == 'a').astype(int)
df['CName_b'] = (df.CName == 'b').astype(int)
ff = (df.CName == 'b') | (df.CName == 'a')
df['CNname_num'] = np.where(ff, np.NaN, df.CName)
df2 = df.drop('CName', axis=1)
print(df2)

   CName_a  CName_b CNname_num
0        1        0        NaN
1        0        1        NaN
2        0        0          0
3        0        0          1
4        0        0          2

Question 1.
Q1: How this can be done efficiently? Ideally I need to chain it in a Pipeline, some fit_transform kind ot thing? I have to write from scratch or there is a hack from common libraries to hot-encode a subset of a column, like ['a', 'b', 'else'] ?

Question 2.
Q2: How should I fill the 'Nan' for the CName_num? The categorical elements ('a' and 'b' in the example) have behavior that differ from the average of the numerical (actually from any of the numerical). I feel assign 0 or 'mean' is not the right choice, but I ran out of options. I plan to use Random Forest, DNN, or even Regression-like training if it performs decently.

Hot-Encoding only on some elements of a column

Answers (1)

Related Questions