jwacalex
jwacalex

Reputation: 517

unable to use FeatureUnion in scikit-learn due to different dimensions

I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions


Implementaion

My FeatureUnion is built the following way:

    features = FeatureUnion([
        ('f1', Pipeline([
            ('get', GetItemTransformer('f1')),
            ('transform', vectorizer_f1)
        ])),
        ('f2', Pipeline([
            ('get', GetItemTransformer('f2')),
            ('transform', vectorizer_f1)
        ]))
    ])

GetItemTransformer is used to get different parts of data out of the same structure. The Idea is described here in the scikit-learn issue-tracker.

The Structure itself is stored as {'f1': data_f1, 'f2': data_f2} where data_f1 are different lists with different lengths.


Question

Since the Y-Vector is different to the Data-Fields I assume that the error occurs, but how can I scale the vector to fit in both cases?

Upvotes: 14

Views: 6644

Answers (2)

Josh
Josh

Reputation: 655

Here's what worked for me:

class ArrayCaster(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self

  def transform(self, data):
    print data.shape
    print np.transpose(np.matrix(data)).shape
    return np.transpose(np.matrix(data))

FeatureUnion([('text', Pipeline([
            ('selector', ItemSelector(key='text')),
            ('vect', CountVectorizer(ngram_range=(1,1), binary=True, min_df=3)),
            ('tfidf', TfidfTransformer())
          ])
        ),

        ('other data', Pipeline([
            ('selector', ItemSelector(key='has_foriegn_char')),
            ('caster', ArrayCaster())
          ])
        )])

Upvotes: 7

Jim K.
Jim K.

Reputation: 934

I don't know if this applies to your question, but we ran into the same error in a slightly different situation and just solved it.

Our f1 entries were each lists of 15 numeric values and we needed to do tf-idf on f2. This generated the same error about incompatible row dimensions.

After running it through the debugger, we found that the shapes of our matrices were subtly different going into the hstack() call in FeatureUnion: (2569,) and (2659, 706).

If we cast f1 to a 2D numpy array, the shape changed to (2659, 15) and the hstack call works.

The cast was something like this: f1 = np.array(list(f1)).

Upvotes: 3

Related Questions