void
void

Reputation: 2543

Sklearn: FeatureUnion of heterogenous features gives incompatible row dimensions error with classifier in the pipeline

I want to do a binary classification based on different features I have (both text and numerical). Training data is the form of pandas dataframe. My pipeline looks something like this:

final_pipeline = Pipeline([('union', FeatureUnion(
                transformer_list=[('body_trans', Pipeline([('selector', ItemSelector(key='body')),
                                                          ('count_vect', CountVectorizer())])),
                                  ('body_trans2', Pipeline([('selector', ItemSelector(key='body2')),
                                                          ('count_vect', TfidfVectorizer())])),
                                 ('length_trans', Pipeline([('selector', ItemSelector(key='length')),
                                                           ('min_max_scaler',  MinMaxScaler())]))],
                transformer_weights={'body_trans': 1.0,'body_trans2': 1.0,'length_trans': 1.0})),
                          ('svc', SVC())])

ItemSelector looks like this:

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_frame):
        return data_frame[[self.key]]

Now, when I try final_pipeline.fit(X_train, y_train), it gives me the ValueError: blocks[0,:] has incompatible row dimensions exception.

X_train, X_test, y_train, y_test = train_test_split(train_set, target_set)

is how I get my training data. train_set is a dataframe with the fields body, body2, length etc. target_set is a dataframe with only a field called label which is my actual label to classify.

Edit:

I think my input data to the pipeline is not in the proper format.

train_set is my training data with the features, sample :

   body           length  body2
0  blah-blah      193     blah-blah-2
1  blah-blah-blah 153     blah-blah-blah-2 

and the target_set, which is the DataFrame with the classifying label

  label
0  True
1  False

If there is any tutorial on input format for a Pipeline's fitting parameters using DataFrames, please provide me with a link! I can't find proper documentation as to how to load DataFrames as input for Pipelines while using multiple columns as separate features.

Any help is appreciated!

Upvotes: 1

Views: 1912

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36619

The issue is in your ItemSelector. It outputs a 2-d dataframe, but the CountVectorizer and TfidfVectorizer needs a 1-d array of strings.

Code to show the output of ItemSelector:-

import numpy as np
from pandas import DataFrame
df = DataFrame(columns = ['body','length','body2'],data=np.array([['blah-blah', 193, 'blah-blah-2'],['blah-blah-2', 153, 'blah-blah-blah-2'] ]))

body_selector = ItemSelector(key='body')
df_body = body_selector.fit_transform(df)

df_body.shape
# (2,1)

You can define another class which can ravel the data to be presented to next step in correct form.

Add this class to your code like this:

class Converter(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self

    def transform(self, data_frame):
        return data_frame.values.ravel()

Then define your pipeline like this:

final_pipeline = Pipeline([('union', FeatureUnion(
                transformer_list=[('body_trans', Pipeline([('selector', ItemSelector(key='body')),
                                                           ('converter', Converter()),
                                                          ('count_vect', CountVectorizer())])),
                                  ('body_trans2', Pipeline([('selector', ItemSelector(key='body2')),
                                                            ('converter', Converter()),
                                                          ('count_vect', TfidfVectorizer())])),
                                 ('length_trans', Pipeline([('selector', ItemSelector(key='length')),
                                                           ('min_max_scaler',  MinMaxScaler())]))],
                transformer_weights={'body_trans': 1.0,'body_trans2': 1.0,'length_trans': 1.0})),
                          ('svc', SVC())])

No need to add this to third part, because MinMaxScalar requires 2-D input data.

Feel free to ask if any problem.

Upvotes: 7

Related Questions