Lusian
Lusian

Reputation: 653

Featurizer to eliminate features

I am trying to set up a featurizers which drops out all but the first 10 columns of my database. The database consists of 76 columns in total. The idea is to apply a PolynomialFeatures(1)) to the 10 columns I would like to keep, but then I cannot see a way to eliminate smartly the remaining 66 columns (I was thinking something like PolynomialFeatures(0)) but it does not seem to work. The idea was to multiply them by the constant 0). The issues are basically 2: 1) how to tell DataFrameMapper to apply the same featurizer to a range of columns (namely A_11 to A_76); 2) how to tell DataFrameMapper to apply aa featurizer that eliminates such columns.

The (incomplete) code I tried so far looks as follows. I denoted A_11-A_76 the issue 1) (i.e. the range) and as ? the issue 2 in the code:

from dml_iv.utilities import SubsetWrapper, ConstantModel
from econml.sklearn_extensions.linear_model import StatsModelsLinearRegression

col = ["A_"+str(k) for k in range(XW.shape[1])]
XW_db = pd.DataFrame(XW, columns=col)

from sklearn_pandas import DataFrameMapper

subset_names = set(['A_0','A_1','A_2','A_3','A_4','A_5','A_6','A_7','A_8','A_9','A_10'])
# list of indices of features X to use in the final model

mapper = DataFrameMapper([
('A_0', PolynomialFeatures(1)),
('A_1', PolynomialFeatures(1)),
('A_2', PolynomialFeatures(1)),
('A_3', PolynomialFeatures(1)),
('A_4', PolynomialFeatures(1)),
('A_5', PolynomialFeatures(1)),
('A_11 - A_66', ?)]) ## PROBLEMATIC PART

Upvotes: 1

Views: 113

Answers (1)

Max
Max

Reputation: 13334

Why don't you drop columns you don't want from your dataframe and map what's left?

cols_map = [...] # list of columns to map
cols_drop = [...] # list of columns to drop
XW_db = XW_db.drop(cols_drop, axis=1) # you're left with only what to map
mapper = DataFrameMapper(cols_map)
...

If the reason for not wanting to drop columns is that they will be used later, you can simply assign the result of your drop to other variables, thus creating several subset dataframes which are easier to manipulate:

df2 = df1.drop(cols_drop2,axis=1) # df2 is a subset of df1
df3 = df1.drop(cols_drop3,axis=1) # df3 is a subset of df1
# Alternative is to decide what to keep instead of what to drop
df4 = df1[cols_keep] # df4 is a subset of df1
# df1 remains the full dataframe    

Upvotes: 1

Related Questions