Reputation: 162
I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can generate a PMML-file later on.
This is the code to create a mapper. The categorical variables I would like to encode are stored in a list called 'dummies'.
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies] +
[(d, OneHotEncoder()) for d in dummies]
)
And this is the code to create a pipeline, including the mapper and linear regression.
from sklearn2pmml import PMMLPipeline
from sklearn.linear_model import LinearRegression
lm = PMMLPipeline([("mapper", mapper),
("regressor", LinearRegression())])
When I now try to fit (with 'features' being a dataframe, and 'targets' a series), it gives an error 'could not convert string to float'.
lm.fit(features, targets)
Upvotes: 14
Views: 26337
Reputation: 698
LabelEncoder
and LabelBinarizer
are intended for encoding/binarizing the target (label) of your data, i.e. the y
vector. Of course they do more or less the same thing as OneHotEncoder
, the main difference being the Label preprocessing steps don't accept matrices, only 1-D vectors.
example = pd.DataFrame({'x':np.arange(2,14,2),
'cat1':['A','B','A','B','C','A'],
'cat2':['p','q','w','p','q','w']})
dummies = ['cat1', 'cat2']
x cat1 cat2
0 2 A p
1 4 B q
2 6 A w
3 8 B p
4 10 C q
5 12 A w
As an example, LabelEncoder().fit_transform(example['cat1'])
works, but LabelEncoder().fit_transform(example[dummies])
throws a ValueError
exception.
In contrast, OneHotEncoder
accepts multiple columns:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder().fit_transform(example[dummies])
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
This can be incorporated into a pipeline using a ColumnTransformer
, passing through (or alternatively applying different transformations to) the other columns :
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encode_cats', OneHotEncoder(), dummies),],
remainder='passthrough')
pd.DataFrame(ct.fit_transform(example), columns = ct.get_feature_names_out())
encode_cats__cat1_A encode_cats__cat1_B ... encode_cats__cat2_w remainder__x
0 1.0 0.0 ... 0.0 2.0
1 0.0 1.0 ... 0.0 4.0
2 1.0 0.0 ... 1.0 6.0
3 0.0 1.0 ... 0.0 8.0
4 0.0 0.0 ... 0.0 10.0
5 1.0 0.0 ... 1.0 12.0
Finally, slot this into a pipeline:
from sklearn.pipeline import Pipeline
Pipeline([('preprocessing', ct),
('regressor', LinearRegression())])
Upvotes: 4
Reputation: 7195
OneHotEncoder
doesn't support string features, and with [(d, OneHotEncoder()) for d in dummies]
you are applying it to all dummies columns. Use LabelBinarizer
instead:
mapper = DataFrameMapper(
[(d, LabelBinarizer()) for d in dummies]
)
An alternative would be to use the LabelEncoder
with a second OneHotEncoder
step.
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies]
)
lm = PMMLPipeline([("mapper", mapper),
("onehot", OneHotEncoder()),
("regressor", LinearRegression())])
Upvotes: 9