Reputation: 432
I have a dataset which includes numerical features and 25 very high cardinality categorical features, and I need to encode it in a meaningful way for it to be used for training predictive algorithms. My issue is that all 25 columns kind of correspond to the same concept, so ideally they should be all encoded as a group. Let me explain. The Pandas dataframe looks like this:
memberid code1 code2 code3 ... code25 cost
memberA c1 c2 c4 c3 100.0
memberB c2 c3 c1 NaN 120.0
memberC c1 c2 c5 c3 200.0
This is generated by this code (only 4 "code" columns here):
data = {'memberid': ['memberA', 'memberB', 'memberC'],
'code1': ['c1', 'c2', 'c1'],
'code2': ['c2', 'c3', 'c2'],
'code3': ['c4', 'c1', 'c5'],
'code25': ['c3', np.nan, 'c3'],
'cost': [100.0, 120.0, 200.0]}
df = pd.DataFrame(data, columns = ['memberid', 'code1', 'code2', 'code3', 'code25', 'cost'])
I found a way to one-hot encode the "code" columns together, i.e., create a dataframe that looks like this:
has_c1 has_c2 has_c3 has_c4 has_c5
1 1 1 1 0
1 1 1 0 0
1 1 1 0 1
My problem is that all "code" columns take values of very high cardinality, so one-hot encoding like I just described would blow up the dimensions of my data by adding another ~15,000 (sparse) columns to the dataset. Unfortunately this is prohibitive from a memory standpoint for fitting ML algorithms, so I thought of looking into hashing encoding for this issue.
Unfortunately, although I was able to manually one-hot encode the "code" columns using numpy and ones/zeros, I don't know how I would be able to "group" the information of all "code" columns into, say, 50 columns including the components of hashing encoding. Is this doable? Or should I follow an entirely different approach on encoding these high cardinality "group" of features together?
Upvotes: 1
Views: 778
Reputation: 323276
Try with get_dummies
then sum
output = pd.get_dummies(df.filter(like='code'), prefix='Has').sum(level=0,axis=1)
Out[549]:
Has_c1 Has_c2 Has_c3 Has_c4 Has_c5
0 1 1 1 1 0
1 1 1 1 0 0
2 1 1 1 0 1
Upvotes: 1