Reputation: 165
I'm training a model using sklearn, and there's a sequence of my training that requires running two different feature extraction pipelines.
For some reason each pipeline fits the data without issue, and when they occur in sequence, they transform the data without issue either.
However when the first pipeline is called after the second pipeline has already been fitted, the first pipeline has been altered and this results in a dimension mismatch error.
In the code below you can recreate the issue (I've simplified it heavily, in reality my two pipelines use different parameters but this is a minimally reproducible example).
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vectorizer = CountVectorizer()
data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']
pipeline1 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data1)
print(pipeline1.transform(data1))
# Works fine
pipeline2 = Pipeline([('vec', vectorizer),('svd', TruncatedSVD(n_components = 3))]).fit(data2)
print(pipeline2.transform(data2))
# Works fine
print(pipeline1.transform(data1))
# ValueError: dimension mismatch
Clearly the fitting of "pipeline2" is in some way interfering with "pipeline1" but I have no clue why. I'd like to be able to use them concurrently.
Upvotes: 5
Views: 339
Reputation: 3591
As you define vectorizer
first, here is what happens :
vectorizer
you fit the first pipeline :
you fit the second pipeline :
you call back the first pipeline :
How to verify this :
vectorizer = CountVectorizer()
data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']
pipeline1 = Pipeline([('vec', vectorizer)]).fit(data1)
print(pipeline1.transform(data1).shape)
(3, 4)
# Works fine
pipeline2 = Pipeline([('vec', vectorizer)]).fit(data2)
print(pipeline2.transform(data2).shape)
(3, 6)
# Works fine
# vectorizer = CountVectorizer()
print(pipeline1.transform(data1).shape)
(3, 6)
You just have to include the definition of the vectorizer in the pipeline, like so :
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
data1 = ['foo bar', 'a foo bar duck', 'goose goose']
data2 = ['foo', 'duck duck swan', 'goose king queen goose']
pipeline1 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data1)
print(pipeline1.transform(data1))
# Works fine
pipeline2 = Pipeline([('vec', CountVectorizer()),('svd', TruncatedSVD(n_components = 3))]).fit(data2)
print(pipeline2.transform(data2))
# Works fine
print(pipeline1.transform(data1))
Upvotes: 3