Reputation: 8903
Having a DF with language column, I'm trying to understand the result I'm getting using HashingEncoder
.
Simplified DF (have more columns in the DF):
language country
0 English US
1 English US
2 English US
3 English AU
4 Italian IT
5 Portuguese JP
6 English US
7 English AU
8 English US
9 German DE
10 French CA
11 English UK
12 English US
13 English US
14 Italian IT
15 Italian IT
16 English UK
17 English US
18 English US
19 English AU
20 English AU
21 English AU
22 Italian IT
23 English UK
24 English US
25 French FR
26 English UK
27 English US
28 English US
29 Japanese AU
30 English AU
31 English AU
32 English US
33 English AU
34 English AU
35 English UK
36 English AU
37 English UK
38 English US
39 English US
40 English US
41 English AU
42 English US
43 English UK
44 English AU
45 English AU
46 English UK
47 English US
48 English AU
49 English US
I've fit transformed the language column using the following code:
from category_encoders.hashing import HashingEncoder
ce_hash = HashingEncoder(cols = ['language'])
df2 = ce_hash.hashing_trick(df,N=2)
df2['lang'] = df['language']
The result:
col_0 col_1 lang
0 26 7 English
1 20 13 English
2 23 10 English
3 18 15 English
4 22 11 Italian
5 19 14 Portuguese
6 20 13 English
7 19 14 English
8 19 14 English
9 19 14 German
why do I get different values in col_0 col_1
for the same language?
Update: When I removed other columns from my DF and left only the language column, the result looks like so:
col_0 col_1 lang
0 1 0 English
1 1 0 English
2 1 0 English
3 1 0 English
4 0 1 Italian
5 0 1 Portuguese
6 1 0 English
7 1 0 English
8 1 0 English
9 1 0 German
My guess is that the hashing_trick
method is using information from other columns from the DF.
Q: How can I use the hashing trick to encode the language category with minimum collisions?
Upvotes: 1
Views: 217
Reputation: 61910
According to Wikipedia, the hashing trick:
turns arbitrary features into indices in a vector or matrix
Here N, is the output dimension (number of indices in the vector mentioned above), so to minimize collisions increase the output dimension, for example:
df2 = ce_hash.hashing_trick(df, N=6, cols=['language'])
df2['lang'] = df['language']
print(df2)
Output
col_0 col_1 col_2 col_3 col_4 col_5 lang
0 0 0 0 0 1 0 English
1 0 0 0 0 1 0 English
2 0 0 0 0 1 0 English
3 0 0 0 0 1 0 English
4 0 0 0 1 0 0 Italian
5 0 1 0 0 0 0 Portuguese
6 0 0 0 0 1 0 English
7 0 0 0 0 1 0 English
8 0 0 0 0 1 0 English
9 1 0 0 0 0 0 German
Regarding the number of input columns use for hashing, by looking at the code it can be seen that:
if cols is None:
cols = X.columns.values
basically it will use all the columns of the input df. The hashing_trick function does no uses any information of the calling object.
Finally to determine the number of output dimensions automatically, use fit_transform:
df2 = ce_hash.fit_transform(df)
df2['lang'] = df['language']
print(df2)
Output
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 lang
0 0 0 0 0 1 0 0 0 English
1 0 0 0 0 1 0 0 0 English
2 0 0 0 0 1 0 0 0 English
3 0 0 0 0 1 0 0 0 English
4 0 0 0 0 0 1 0 0 Italian
5 0 1 0 0 0 0 0 0 Portuguese
6 0 0 0 0 1 0 0 0 English
7 0 0 0 0 1 0 0 0 English
8 0 0 0 0 1 0 0 0 English
9 0 0 0 0 0 0 1 0 German
Upvotes: 1