Strange behaviour with multiple scikit learn pipelines

Question

I'm training a model using sklearn, and there's a sequence of my training that requires running two different feature extraction pipelines.

For some reason each pipeline fits the data without issue, and when they occur in sequence, they transform the data without issue either.

However when the first pipeline is called after the second pipeline has already been fitted, the first pipeline has been altered and this results in a dimension mismatch error.

In the code below you can recreate the issue (I've simplified it heavily, in reality my two pipelines use different parameters but this is a minimally reproducible example).

from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vectorizer = CountVectorizer()

data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data1)

print(pipeline1.transform(data1))

# Works fine

pipeline2 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data2)

print(pipeline2.transform(data2))

# Works fine

print(pipeline1.transform(data1))

# ValueError: dimension mismatch

Clearly the fitting of "pipeline2" is in some way interfering with "pipeline1" but I have no clue why. I'd like to be able to use them concurrently.

CoMartel · Accepted Answer

What happens :

As you define vectorizer first, here is what happens :

You create vectorizer
you fit the first pipeline :
- vectorizer is fitted, output dim is (3,4), e.g 3 elements, 4 words : foo, bar, duck, goose
- svd is fitted to have 4 columns as input
you fit the second pipeline :
- vectorizer is fitted again, this time with 6 words (e.g columns) as output : foo, duck, swan, goose, king, queen
- the other svd is fitted, not relevant here
you call back the first pipeline :
- the vectorizer outputs a (3,6) matrix, using words from the last fit, e.g the second pipeline
- the svd has been fitted to accept 4 columns as input, raise an exception.

How to verify this :

vectorizer = CountVectorizer()

data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', vectorizer)]).fit(data1)
print(pipeline1.transform(data1).shape)

(3, 4)

# Works fine
pipeline2 = Pipeline([('vec', vectorizer)]).fit(data2)
print(pipeline2.transform(data2).shape)

(3, 6)

# Works fine

# vectorizer = CountVectorizer()
print(pipeline1.transform(data1).shape)

(3, 6)

How to fix it :

You just have to include the definition of the vectorizer in the pipeline, like so :

from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']

pipeline1 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data1)

print(pipeline1.transform(data1))

# Works fine

pipeline2 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data2)

print(pipeline2.transform(data2))

# Works fine

print(pipeline1.transform(data1))

Strange behaviour with multiple scikit learn pipelines

Answers (1)

What happens :

How to fix it :

Related Questions