Sue Doh Nimh
Sue Doh Nimh

Reputation: 165

Strange behaviour with multiple scikit learn pipelines

I'm training a model using sklearn, and there's a sequence of my training that requires running two different feature extraction pipelines.

For some reason each pipeline fits the data without issue, and when they occur in sequence, they transform the data without issue either.

However when the first pipeline is called after the second pipeline has already been fitted, the first pipeline has been altered and this results in a dimension mismatch error.

In the code below you can recreate the issue (I've simplified it heavily, in reality my two pipelines use different parameters but this is a minimally reproducible example).

from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vectorizer = CountVectorizer()

data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data1)

print(pipeline1.transform(data1))

# Works fine

pipeline2 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data2)

print(pipeline2.transform(data2))

# Works fine

print(pipeline1.transform(data1))

# ValueError: dimension mismatch

Clearly the fitting of "pipeline2" is in some way interfering with "pipeline1" but I have no clue why. I'd like to be able to use them concurrently.

Upvotes: 5

Views: 339

Answers (1)

CoMartel
CoMartel

Reputation: 3591

What happens :

As you define vectorizer first, here is what happens :

  1. You create vectorizer
  2. you fit the first pipeline :

    • vectorizer is fitted, output dim is (3,4), e.g 3 elements, 4 words : foo, bar, duck, goose
    • svd is fitted to have 4 columns as input
  3. you fit the second pipeline :

    • vectorizer is fitted again, this time with 6 words (e.g columns) as output : foo, duck, swan, goose, king, queen
    • the other svd is fitted, not relevant here
  4. you call back the first pipeline :

    • the vectorizer outputs a (3,6) matrix, using words from the last fit, e.g the second pipeline
    • the svd has been fitted to accept 4 columns as input, raise an exception.

How to verify this :

vectorizer = CountVectorizer()

data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', vectorizer)]).fit(data1)
print(pipeline1.transform(data1).shape)

(3, 4)

# Works fine
pipeline2 = Pipeline([('vec', vectorizer)]).fit(data2)
print(pipeline2.transform(data2).shape)

(3, 6)

# Works fine

# vectorizer = CountVectorizer()
print(pipeline1.transform(data1).shape)

(3, 6)

How to fix it :

You just have to include the definition of the vectorizer in the pipeline, like so :

from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data1)

print(pipeline1.transform(data1))

# Works fine

pipeline2 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data2)

print(pipeline2.transform(data2))

# Works fine

print(pipeline1.transform(data1))

Upvotes: 3

Related Questions