nvergos
nvergos

Reputation: 432

Hash-encode multiple columns containing "same" high-cardinality information in Pandas

I have a dataset which includes numerical features and 25 very high cardinality categorical features, and I need to encode it in a meaningful way for it to be used for training predictive algorithms. My issue is that all 25 columns kind of correspond to the same concept, so ideally they should be all encoded as a group. Let me explain. The Pandas dataframe looks like this:

 memberid      code1   code2   code3 ...  code25      cost
 memberA       c1      c2      c4         c3          100.0
 memberB       c2      c3      c1         NaN         120.0
 memberC       c1      c2      c5         c3          200.0

This is generated by this code (only 4 "code" columns here):

data = {'memberid': ['memberA', 'memberB', 'memberC'], 
        'code1': ['c1', 'c2', 'c1'],
        'code2': ['c2', 'c3', 'c2'],
        'code3': ['c4', 'c1', 'c5'],
        'code25': ['c3', np.nan, 'c3'],
        'cost': [100.0, 120.0, 200.0]}

df = pd.DataFrame(data, columns = ['memberid', 'code1', 'code2', 'code3', 'code25', 'cost'])

I found a way to one-hot encode the "code" columns together, i.e., create a dataframe that looks like this:

has_c1   has_c2   has_c3   has_c4   has_c5      
1        1        1        1        0  
1        1        1        0        0
1        1        1        0        1

My problem is that all "code" columns take values of very high cardinality, so one-hot encoding like I just described would blow up the dimensions of my data by adding another ~15,000 (sparse) columns to the dataset. Unfortunately this is prohibitive from a memory standpoint for fitting ML algorithms, so I thought of looking into hashing encoding for this issue.

Unfortunately, although I was able to manually one-hot encode the "code" columns using numpy and ones/zeros, I don't know how I would be able to "group" the information of all "code" columns into, say, 50 columns including the components of hashing encoding. Is this doable? Or should I follow an entirely different approach on encoding these high cardinality "group" of features together?

Upvotes: 1

Views: 778

Answers (1)

BENY
BENY

Reputation: 323276

Try with get_dummies then sum

output = pd.get_dummies(df.filter(like='code'), prefix='Has').sum(level=0,axis=1)
Out[549]: 
   Has_c1  Has_c2  Has_c3  Has_c4  Has_c5
0       1       1       1       1       0
1       1       1       1       0       0
2       1       1       1       0       1

Upvotes: 1

Related Questions