Reputation: 31
I have a cleaned housing dataset with about 75 total features and 1 target variable. In order to use lasso regression for selecting the most relevant of the 75 features, I am only able to use label encoding for the categorical features, as it preserves column identity as follows:
# Label Encoding all other categorical features:
for x in categorical_features:
labels_ordered=house_df.groupby([x])['SalePrice'].mean().sort_values().index # SalePrice is target variable
labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
house_df[x]=house_df[x].map(labels_ordered)
# After splitting into train/test and fitting the lasso
feature_sel_model = SelectFromModel(Lasso(alpha=0.005, random_state=0))
feature_sel_model.fit(X_train, y_train)
# Checking the array of selected and rejected features
feature_sel_model.get_support()
O/P: array([ True, True, False, False, False, False, False, False, False,
False, True, False, False, False, False, True, True, False,
True, False, False, False, False, False, False, False, False,
True, True, False, True, False, True, False, False, False,
True, False, True, True, False, True, False, False, True,
False, False, False, False, False, False, True, False, False,
True, False, False, False, True, True, True, False, False,
True, False, False, False, False, False, False, False, False,
False, False, True])
# Making a list of the selected features
selected_feat = X_train.columns[(feature_sel_model.get_support())]
# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
O/P: total features: 75
selected features: 22
The column identity is needed to use the output of lasso regression and remove the irrelevant features from the original dataset.
My problem is that the categorical features have multiple labels and not ordinal, so OneHotEncoding using sklearn would actually be the best method of encoding but would create a complex matrix, destroying column identity. How do I use the output of OHE (which is a np.arrray with all encoded variables brought to the left of the matrix) to feed to the lasso regressor? Or should I stick to label encoding?
Upvotes: 0
Views: 516
Reputation: 46908
If for example a particular column has categories A,B,C and D and this will be expanded to 4 columns, 0/1 for A, 0/1 for B and so on. After running the regression if for example A and B are dropped (having coefficient 0), it means the information of being A and B are not useful in the final model, while being C and D are.
If we fit the model again just using the binary columns for C,D for prediction again, this works perfectly well, because samples with A,B for the category will not be defined as not C or not D.
So it depends on what is the aim of doing the lasso. If it is prediction, that is to select variables and refit it again into a linear model (or lasso), then passing the numpy array would be fine.
If you would want to identify features that are so called important, you might have to look into what is kept and infer what it means.
Upvotes: 0
Reputation: 563
First of all, you should scale your numeric features when using Lasso for feature importance (I used MinMaxScaler
in my example).
pandas.get_dummies()
# One Hot Encoding
ohe_df = pd.get_dummies(house_df, columns=list_cat_of_cols)
# split into train/test and do other stuff
...
OneHotEncoder
has a method get_feature_names()
By calling ohe.get_feature_names(cat_cols)
, it will return labels for the encoded categorical columns.
I suggest reading documentation for any further explanation.
Example:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.compose import ColumnTransformer
df = pd.DataFrame({'A1': ['a','a','b','a','c','b'],
'A2': ['x', 'y', 'y', 'y', 'x', 'x'],
'B': [1,2,3,1,5,2],
'C': [1.19,2.21,3.51,1.23,5.12,2.49]})
X = df.drop(columns=['C'])
y = df['C']
cat_cols = ['A1', 'A2']
other_cols = X.drop(columns=cat_cols).columns
ct = ColumnTransformer([('ohe', OneHotEncoder(sparse=False), cat_cols)], remainder=MinMaxScaler())
encoded_matrix = ct.fit_transform(X)
encoded_cols = ct.named_transformers_.ohe.get_feature_names(cat_cols)
all_features = np.concatenate([encoded_cols, other_cols])
print('all_features:', all_features)
feature_sel_model = SelectFromModel(Lasso(alpha=0.05))
feature_sel_model.fit(encoded_matrix, y)
feature_mask = feature_sel_model.get_support()
print('selected_features:', all_features[feature_mask])
Output:
all_features: ['A1_a' 'A1_b' 'A1_c' 'A2_y' 'B']
selected_features: ['A1_b' 'B']
You should use OneHotEncoder
in case of using the same encoder on the test data. More info here: https://stackoverflow.com/a/56567037/7623492
Upvotes: 2