Reputation: 4417
My question is I have so many columns in my pandas data frame and I am trying to apply the sklearn preprocessing using dataframe mapper from sklearn-pandas library such as
mapper= DataFrameMapper([
('gender',sklearn.preprocessing.LabelBinarizer()),
('gradelevel',sklearn.preprocessing.LabelEncoder()),
('subject',sklearn.preprocessing.LabelEncoder()),
('districtid',sklearn.preprocessing.LabelEncoder()),
('sbmRate',sklearn.preprocessing.StandardScaler()),
('pRate',sklearn.preprocessing.StandardScaler()),
('assn1',sklearn.preprocessing.StandardScaler()),
('assn2',sklearn.preprocessing.StandardScaler()),
('assn3',sklearn.preprocessing.StandardScaler()),
('assn4',sklearn.preprocessing.StandardScaler()),
('assn5',sklearn.preprocessing.StandardScaler()),
('attd1',sklearn.preprocessing.StandardScaler()),
('attd2',sklearn.preprocessing.StandardScaler()),
('attd3',sklearn.preprocessing.StandardScaler()),
('attd4',sklearn.preprocessing.StandardScaler()),
('attd5',sklearn.preprocessing.StandardScaler()),
('sbm1',sklearn.preprocessing.StandardScaler()),
('sbm2',sklearn.preprocessing.StandardScaler()),
('sbm3',sklearn.preprocessing.StandardScaler()),
('sbm4',sklearn.preprocessing.StandardScaler()),
('sbm5',sklearn.preprocessing.StandardScaler())
])
I am just wondering whether there is another more succinct way for me to preprocess many variables at one time without writing them out explicitly.
Another thing that I found a little bit annoying is when I transformed all the pandas data frame into arrays which sklearn can work with, they will lose the column name features, which makes the selection very difficult. Does anyone knows how to preserve the column names as the key when change the pandas data frames to np arrays?
Thank you so much!
Upvotes: 2
Views: 3648
Reputation: 13279
from sklearn.preprocessing import LabelBinarizer, LabelEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper
encoders = ['gradelevel', 'subject', 'districtid']
scalars = ['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5']
mapper = DataFrameMapper(
[('gender', LabelBinarizer())] +
[(encoder, LabelEncoder()) for encoder in encoders] +
[(scalar, StandardScaler()) for scalar in scalars]
)
If you're doing this a lot, you could even write your own function:
mapper = data_frame_mapper(binarizers=['gender'],
encoders=['gradelevel', 'subject', 'districtid'],
scalars=['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5'])
Upvotes: 9