Alessandro
Alessandro

Reputation: 4100

LabelEncoding DataFrame string columns

I've got a DataFrame with floats, strings, and strings that can be interpreted as dates.

Label encoding across multiple columns in scikit-learn

from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseException, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

num_attributes = ["a", "b", "c"]
num_attributes = list(df_num_median)
str_attributes = list(df_str_only)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attributes)), # transforming the Pandas DataFrame into a NumPy array
    ('imputer', Imputer(strategy="median")), # replacing missing values with the median
    ('std_scalar', StandardScaler()), # scaling the features using standardization (subtract mean value, divide by variance)
])

from sklearn.preprocessing import LabelEncoder

str_pipeline = Pipeline([
    ('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array 
    ('encoding', MultiColumnLabelEncoder(str_attributes))
])

from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    #("str_pipeline", str_pipeline) # replaced by line below
    ("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])

df_prepared = full_pipeline.fit_transform(df_combined)

The num_pipeline part of the pipeline works just fine. In the str_pipeline part I get the error

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

This doesn't happen if I comment out the MultiColumnLabelEncoder in the str_pipeline. I also created some code to apply the MultiColumnLabelEncoder on the dataset without the pipeline and it works just fine. Any ideas? As an additional step, I would have to create two separate pipelines for strings and date strings.

EDIT: added DataFrameSelector class

enter image description here

Upvotes: 0

Views: 661

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

The problem is not in the MultiColumnLabelEncoder, but in the DataFrameSelector above it in the pipeline.

You are doing this:

str_pipeline = Pipeline([
    ('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array 
    ('encoding', MultiColumnLabelEncoder(str_attributes))
])

DataFrameSelector returns .values attribute of the dataframe, which is a numpy array. So obviously when you do this in MultiColumnLabelEncoder:

...
...
    if self.columns is not None:
        for col in self.columns:
            output[col] = LabelEncoder().fit_transform(output[col])

the error is thrown by output[col]. Since output is a copy of X which is a numpy array (because it has been converted to numpy array by DataFrameSelector) and it does not have information about the column names.

Since you are already passing 'str_attributes' to MultiColumnLabelEncoder, you dont need to have DataFrameSelector in the pipeline. Just do this:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])

I have removed the str_pipeline because it had only a single transformer now (after removing DataFrameSelector).

Upvotes: 1

Related Questions