Reputation: 95
I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:
Should i use LabelEncoder
and then do the feature selection? Any ideas how to deal with this?
Here is my code
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()
Upvotes: 1
Views: 1701
Reputation: 46898
It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.
You can use the group lasso implementation in python. Below I use an example dataset:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups
import scipy.sparse
data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
'cat2':np.random.choice(['D','E','F'],100),
'bin1':np.random.choice([0,1],100),
'bin2':np.random.choice([0,1],100)})
data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)
Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:
cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)
Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:
X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']
Likewise, construct the groups:
groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups
Run the group lasso:
grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)
grpLasso.sparsity_mask_
array([ True, True, True, False, False, False, True, True])
grpLasso.chosen_groups_
{0, 3, 4}
Check out also the help page for using it in a pipeline.
Upvotes: 1