Using standard and custom transformers in sklearn with pipeline and featureUnion throws error, TFIDF not iterating through the documents?

Question

I have a DataFrame with 14 columns. I am using custom transformers to

Select desired columns from my DataFrame. Five columns from those fourteen.
Select column of certain data type (categorical, object, int etc)
Perform preprocessing on the column based on the type.

My custom ColumnSelector transformer is:

    class ColumnSelector(BaseEstimator, TransformerMixin):
          def __init__(self, columns):
                self.columns = columns

          def fit(self, X, y=None):
              return self

          def transform(self, X):
              assert isinstance(X, pd.DataFrame)

         try:
            return X[self.columns]
    except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)

Followed by custom TypeSelector:

    class TypeSelector(BaseEstimator, TransformerMixin):
        def __init__(self, dtype):
            self.dtype = dtype

        def fit(self, X, y=None):
            return self

        def transform(self, X):
            assert isinstance(X, pd.DataFrame)
            return X.select_dtypes(include=[self.dtype])

The original DataFrame, from which I select desired columns is df_with_types and has 981 rows. The columns I wish to extract are listed below along with there respective data types;

meeting_subject_stem_sentence : 'object', priority_label_stem_sentence : 'object', attendees: 'category', day_of_week: 'category', meeting_time_mins: 'int64'

I then proceed to construct my pipeline the following way

    preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
                        'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
    ("integer_features", make_pipeline(
        TypeSelector('int64'),
        StandardScaler()
    )),
    ("categorical_features", make_pipeline(
        TypeSelector("category"),
        OneHotEnc()
    )),
    ("text_features", make_pipeline(
        TypeSelector("object"),
        TfidfVectorizer(stop_words=stopWords)
    ))
])

)

The error thrown when I fit the pipeline to data is:

    preprocess_pipeline.fit_transform(df_with_types)
    ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 2, expected 981.

I have a hunch this is happening because of the TFIDF vectorizer. Fitting things only on the TFIDF vectorizer without FeatureUnion...

    the_pipe = Pipeline([('col_sel', ColumnSelector(columns=['meeting_subject_stem_sentence',
                        'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence'])),
                 ('type_selector', TypeSelector('object')), ('tfidf', TfidfVectorizer())])

When I fit the_pipe:

    a = the_pipe.fit_transform(df_with_types)

This gives me a 2*2 matrix instead of 981.

    (0, 0)  1.0
    (1, 1)  1.0

Calling the feature names attribute using named_steps, I get

    the_pipe.named_steps['tfidf'].get_feature_names()

    [u'meeting_subject_stem_sentence', u'priority_label_stem_sentence']

It seems to be fitting only on the column names and not iterating through the documents. How do I achieve this in a Pipeline like the above one. Also, if I wanted to apply a pairwise distance/similarity function to each feature as part of the pipeline after ColumnSelector and TypeSelector, what must I do.

An example would be...

    preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
                        'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
    ("integer_features", make_pipeline(
        TypeSelector('int64'),
        StandardScaler(), 
        'Pairwise manhattan distance between each element of the integer feature'
    )),
    ("categorical_features", make_pipeline(
        TypeSelector("category"),
        OneHotEnc(), 
      'Pairwise dice coefficient here'
    )),
    ("text_features", make_pipeline(
        TypeSelector("object"),
        TfidfVectorizer(stop_words=stopWords), 
        'Pairwise cosine similarity here'
    ))
])

)

Please help. Being a beginner, I have been wracking my head with this to no avail. I have gone through zac_stewart's blog and many other similar ones but none seem to talk about how to use TFIDF with TypeSelector or ColumnSelector. Thank you so much for all the help. Hope I formulated the question clearly.

EDIT 1:

If I use a TextSelector transformer, like the following...

    class TextSelector(BaseEstimator, TransformerMixin):
""" Transformer that selects text column from DataFrame by key."""

     def __init__(self, key):
         self.key = key

       def fit(self, X, y=None):
         '''Create X attribute to be transformed'''
           return self

        def transform(self, X, y=None):
          '''the key passed here indicates column name'''
               return X[self.key]

text_processing_pipe_line_1 = Pipeline([('selector', TextSelector(key='meeting_subject')), ('text_1', TfidfVectorizer(stop_words=stopWords))])

    t = text_processing_pipe_line_1.fit_transform(df_with_types)

    (0, 656)    0.378616399898
    (0, 75) 0.378616399898
    (0, 117)    0.519159384271
    (0, 545)    0.512337545421
     (0, 223)   0.425773433566
      (1, 154)  0.5
      (1, 137)  0.5
       (1, 23)  0.5
       (1, 355) 0.5
        (2, 656)    0.497937369182

This works and it is iterating through the documents, thus if I could make TypeSelector return a series, that would right? Thanks again for the help.

Using standard and custom transformers in sklearn with pipeline and featureUnion throws error, TFIDF not iterating through the documents?

Answers (1)

Related Questions