MYjx
MYjx

Reputation: 4417

how to apply preprocessing methods on several columns at one time in sklearn

My question is I have so many columns in my pandas data frame and I am trying to apply the sklearn preprocessing using dataframe mapper from sklearn-pandas library such as

mapper= DataFrameMapper([
    ('gender',sklearn.preprocessing.LabelBinarizer()),
    ('gradelevel',sklearn.preprocessing.LabelEncoder()),
    ('subject',sklearn.preprocessing.LabelEncoder()),
    ('districtid',sklearn.preprocessing.LabelEncoder()),
    ('sbmRate',sklearn.preprocessing.StandardScaler()),
    ('pRate',sklearn.preprocessing.StandardScaler()),
    ('assn1',sklearn.preprocessing.StandardScaler()),
    ('assn2',sklearn.preprocessing.StandardScaler()),
    ('assn3',sklearn.preprocessing.StandardScaler()),
    ('assn4',sklearn.preprocessing.StandardScaler()),
    ('assn5',sklearn.preprocessing.StandardScaler()),
    ('attd1',sklearn.preprocessing.StandardScaler()),
    ('attd2',sklearn.preprocessing.StandardScaler()),
    ('attd3',sklearn.preprocessing.StandardScaler()),
    ('attd4',sklearn.preprocessing.StandardScaler()),
    ('attd5',sklearn.preprocessing.StandardScaler()),
    ('sbm1',sklearn.preprocessing.StandardScaler()),
    ('sbm2',sklearn.preprocessing.StandardScaler()),
    ('sbm3',sklearn.preprocessing.StandardScaler()),
    ('sbm4',sklearn.preprocessing.StandardScaler()),
    ('sbm5',sklearn.preprocessing.StandardScaler())
 ])

I am just wondering whether there is another more succinct way for me to preprocess many variables at one time without writing them out explicitly.

Another thing that I found a little bit annoying is when I transformed all the pandas data frame into arrays which sklearn can work with, they will lose the column name features, which makes the selection very difficult. Does anyone knows how to preserve the column names as the key when change the pandas data frames to np arrays?

Thank you so much!

Upvotes: 2

Views: 3648

Answers (1)

U2EF1
U2EF1

Reputation: 13279

from sklearn.preprocessing import LabelBinarizer, LabelEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper

encoders = ['gradelevel', 'subject', 'districtid']
scalars = ['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5']
mapper = DataFrameMapper(
    [('gender', LabelBinarizer())] +
    [(encoder, LabelEncoder()) for encoder in encoders] +
    [(scalar, StandardScaler()) for scalar in scalars]
)

If you're doing this a lot, you could even write your own function:

mapper = data_frame_mapper(binarizers=['gender'],
    encoders=['gradelevel', 'subject', 'districtid'],
    scalars=['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5'])

Upvotes: 9

Related Questions