Reputation: 4100
I've got a DataFrame with floats, strings, and strings that can be interpreted as dates.
Label encoding across multiple columns in scikit-learn
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseException, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
num_attributes = ["a", "b", "c"]
num_attributes = list(df_num_median)
str_attributes = list(df_str_only)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attributes)), # transforming the Pandas DataFrame into a NumPy array
('imputer', Imputer(strategy="median")), # replacing missing values with the median
('std_scalar', StandardScaler()), # scaling the features using standardization (subtract mean value, divide by variance)
])
from sklearn.preprocessing import LabelEncoder
str_pipeline = Pipeline([
('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array
('encoding', MultiColumnLabelEncoder(str_attributes))
])
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
#("str_pipeline", str_pipeline) # replaced by line below
("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])
df_prepared = full_pipeline.fit_transform(df_combined)
The num_pipeline part of the pipeline works just fine. In the str_pipeline part I get the error
IndexError: only integers, slices (
:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices
This doesn't happen if I comment out the MultiColumnLabelEncoder in the str_pipeline. I also created some code to apply the MultiColumnLabelEncoder on the dataset without the pipeline and it works just fine. Any ideas? As an additional step, I would have to create two separate pipelines for strings and date strings.
EDIT: added DataFrameSelector class
Upvotes: 0
Views: 661
Reputation: 36599
The problem is not in the MultiColumnLabelEncoder
, but in the DataFrameSelector
above it in the pipeline.
You are doing this:
str_pipeline = Pipeline([
('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array
('encoding', MultiColumnLabelEncoder(str_attributes))
])
DataFrameSelector
returns .values
attribute of the dataframe, which is a numpy array. So obviously when you do this in MultiColumnLabelEncoder
:
...
...
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
the error is thrown by output[col]
. Since output
is a copy of X
which is a numpy array (because it has been converted to numpy array by DataFrameSelector
) and it does not have information about the column names.
Since you are already passing 'str_attributes'
to MultiColumnLabelEncoder
, you dont need to have DataFrameSelector in the pipeline. Just do this:
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])
I have removed the str_pipeline because it had only a single transformer now (after removing DataFrameSelector).
Upvotes: 1