Shlomi Schwartz
Shlomi Schwartz

Reputation: 8903

Understanding the hashing trick results

Having a DF with language column, I'm trying to understand the result I'm getting using HashingEncoder.

Simplified DF (have more columns in the DF):

      language country
0      English      US
1      English      US
2      English      US
3      English      AU
4      Italian      IT
5   Portuguese      JP
6      English      US
7      English      AU
8      English      US
9       German      DE
10      French      CA
11     English      UK
12     English      US
13     English      US
14     Italian      IT
15     Italian      IT
16     English      UK
17     English      US
18     English      US
19     English      AU
20     English      AU
21     English      AU
22     Italian      IT
23     English      UK
24     English      US
25      French      FR
26     English      UK
27     English      US
28     English      US
29    Japanese      AU
30     English      AU
31     English      AU
32     English      US
33     English      AU
34     English      AU
35     English      UK
36     English      AU
37     English      UK
38     English      US
39     English      US
40     English      US
41     English      AU
42     English      US
43     English      UK
44     English      AU
45     English      AU
46     English      UK
47     English      US
48     English      AU
49     English      US

I've fit transformed the language column using the following code:

from category_encoders.hashing import HashingEncoder
ce_hash = HashingEncoder(cols = ['language'])
df2 = ce_hash.hashing_trick(df,N=2)
df2['lang'] = df['language']

The result:

   col_0  col_1        lang
0     26      7     English
1     20     13     English
2     23     10     English
3     18     15     English
4     22     11     Italian
5     19     14  Portuguese
6     20     13     English
7     19     14     English
8     19     14     English
9     19     14      German

why do I get different values in col_0 col_1 for the same language?

Update: When I removed other columns from my DF and left only the language column, the result looks like so:

   col_0  col_1        lang
0      1      0     English
1      1      0     English
2      1      0     English
3      1      0     English
4      0      1     Italian
5      0      1  Portuguese
6      1      0     English
7      1      0     English
8      1      0     English
9      1      0      German

My guess is that the hashing_trick method is using information from other columns from the DF.

Q: How can I use the hashing trick to encode the language category with minimum collisions?

Upvotes: 1

Views: 217

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61910

According to Wikipedia, the hashing trick:

turns arbitrary features into indices in a vector or matrix

Here N, is the output dimension (number of indices in the vector mentioned above), so to minimize collisions increase the output dimension, for example:

df2 = ce_hash.hashing_trick(df, N=6, cols=['language'])
df2['lang'] = df['language']
print(df2)

Output

   col_0  col_1  col_2  col_3  col_4  col_5        lang
0      0      0      0      0      1      0     English
1      0      0      0      0      1      0     English
2      0      0      0      0      1      0     English
3      0      0      0      0      1      0     English
4      0      0      0      1      0      0     Italian
5      0      1      0      0      0      0  Portuguese
6      0      0      0      0      1      0     English
7      0      0      0      0      1      0     English
8      0      0      0      0      1      0     English
9      1      0      0      0      0      0      German

Regarding the number of input columns use for hashing, by looking at the code it can be seen that:

if cols is None:
    cols = X.columns.values

basically it will use all the columns of the input df. The hashing_trick function does no uses any information of the calling object.

Finally to determine the number of output dimensions automatically, use fit_transform:

df2 = ce_hash.fit_transform(df)
df2['lang'] = df['language']
print(df2)

Output

   col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7        lang
0      0      0      0      0      1      0      0      0     English
1      0      0      0      0      1      0      0      0     English
2      0      0      0      0      1      0      0      0     English
3      0      0      0      0      1      0      0      0     English
4      0      0      0      0      0      1      0      0     Italian
5      0      1      0      0      0      0      0      0  Portuguese
6      0      0      0      0      1      0      0      0     English
7      0      0      0      0      1      0      0      0     English
8      0      0      0      0      1      0      0      0     English
9      0      0      0      0      0      0      1      0      German

Upvotes: 1

Related Questions