Reputation: 136
I would like to hash feature ‘Genre’ into 6 columns and separately feature ‘Publisher’ into another six columns. I want something like below:
Genre Publisher 0 1 2 3 4 5 0 1 2 3 4 5
0 Platform Nintendo 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
1 Racing Noir -1.0 0.0 0.0 0.0 0.0 -1.0 -1.0 0.0 0.0 0.0 0.0 -1.0
2 Sports Laura -2.0 2.0 0.0 -2.0 0.0 0.0 -2.0 2.0 0.0 -2.0 0.0 0.0
3 Roleplaying John -2.0 2.0 2.0 0.0 1.0 0.0 -2.0 2.0 2.0 0.0 1.0 0.0
4 Puzzle John 0.0 1.0 1.0 -2.0 1.0 -1.0 0.0 1.0 1.0 -2.0 1.0 -1.0
5 Platform Noir 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
The following code does what I want to do
import pandas as pd
d = {'Genre': ['Platform', 'Racing','Sports','Roleplaying','Puzzle','Platform'], 'Publisher': ['Nintendo', 'Noir','Laura','John','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6, input_type='string')
fh2 = FeatureHasher(n_features=6, input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre', 'Publisher']], pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],
axis=1)
This works for the above two feature but If I have lets say 40 categorical features then this approach would be tedious. Is there any other way to do?
Upvotes: 11
Views: 5098
Reputation: 1221
Even though, I am late here, from the examples I have seen on Kaggle , FeatureHashing is performed at once for multiple columns (ie on a DataFrame) rather than for individual columns and concatenating the sparse matrices. See Notebooks on Kaggle, here and here. I have also used both ways of performing feature hashing on this data, ie:
a. Hash individual categorical columns and concatenate the results
b. Hash all categorical columns of a DataFrame at once
Logistic Regression classifier gave significantly better results when approach (b) was followed rather than approach (a).
Upvotes: 2
Reputation: 4150
Hashing (Update)
Assuming that new categories might show up in some of the features, hashing is the way to go. Just 2 notes:
One Hot Vector
In case the number of categories for each feature is fixed and not too large, use one hot encoding.
I would recommend using either of the two:
sklearn.preprocessing.OneHotEncoder
pandas.get_dummies
Example
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
'feature_2': ['cat', 'dog', 'elephant', 'zebra']})
# Approach 0 (Hashing per feature)
n_orig_features = df.shape[1]
hash_vector_size = 6
ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size,
input_type='string'), i) for i in range(n_orig_features)])
res_0 = ct.fit_transform(df) # res_0.shape[1] = n_orig_features * hash_vector_size
# Approach 1 (OHV)
res_1 = pd.get_dummies(df)
# Approach 2 (OHV)
res_2 = OneHotEncoder(sparse=False).fit_transform(df)
res_0
:
array([[ 0., 0., 0., 0., 1., 0., 0., 0., 1., -1., 0., -1.],
[ 0., 0., 0., 1., 0., 0., 0., 2., -1., 0., 0., 0.],
[ 0., -1., 0., 0., 0., 0., -2., 2., 2., -1., 0., -1.],
[ 0., 0., 0., 0., 1., 0., 0., 2., 1., -1., 0., -1.]])
res_1
:
feature_1_A feature_1_G feature_1_T feature_2_cat feature_2_dog feature_2_elephant feature_2_zebra
0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0
2 0 0 1 0 0 1 0
3 1 0 0 0 0 0 1
res_2
:
array([[1., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0., 1.]])
Upvotes: 7