Fridolin Linder
Fridolin Linder

Reputation: 411

Using spacy as tokenizer in sklearn pipeline

I'm trying to use spacy as a tokenizer in a larger scikit-learn pipeline but consistently run into the problem that the task can't be pickled to be sent to the workers.

Minimal example:

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_20newsgroups
from functools import partial
import spacy


def spacy_tokenize(text, nlp):
    return [x.orth_ for x in nlp(text)]

nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
tok = partial(spacy_tokenize, nlp=nlp)

pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=tok)),
                     ('clf', SGDClassifier())])

params = {'vectorize__ngram_range': [(1, 2), (1, 3)]}

CV = RandomizedSearchCV(pipeline,
                        param_distributions=params,
                        n_iter=2, cv=2, n_jobs=2,
                        scoring='accuracy')

categories = ['alt.atheism', 'comp.graphics']
news = fetch_20newsgroups(subset='train',
                          categories=categories,
                          shuffle=True,
                          random_state=42)

CV.fit(news.data, news.target)

Running this code I get the error:

PicklingError: Could not pickle the task to send it to the workers.

What confuses me is that:

import pickle
pickle.dump(tok, open('test.pkl', 'wb'))

Works without a problem.

Does anybody know if it is possible to use spacy with sklearn cross-validation? Thanks!

Upvotes: 3

Views: 4225

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36619

This is not a solution but a workaround. Looks like there are some issues between spacy and joblib:

If you can save the tokenizer as a function in a separate file in the directory and then import that into your current file, you can avoid this error. Something like:

  • custom_file.py

    import spacy
    nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
    
    def spacy_tokenizer(doc):
        return [x.orth_ for x in nlp(doc)]
    
  • main.py

    #Other code     
    ...
    ... 
    
    from custom_file import spacy_tokenizer
    
    pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=spacy_tokenizer)),
                         ('clf', SGDClassifier())])
    
    ...
    ...
    

Upvotes: 6

Related Questions