Speeding up vectorization in sklearn

Question

First question, sorry if I mess something up.

I'm doing a classification project involving 1600 unique text documents over 90 labels. Many of these documents are research papers, so you can imagine the feature set is quite large - well over a million.

My problem is that vectorizing is taking forever. I understand it won't be fast given my data, but the time it takes is becoming impractical. I took the advice from the first answer to this question and it doesn't seem to have helped - I'm imagining the optimizations the answerer suggests are already incorporated into scikit-learn.

Here's my code, using the adjusted stemmed vectorizer functions:

%%timeit

vect = StemmedCountVectorizer(min_df=3, max_df=0.7, max_features=200000, tokenizer=tokenize,
        strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
        ngram_range=(1, 3), stop_words='english')

vect.fit(list(xtrain) + list(xvalid))
xtrain_cv = vect.transform(xtrain)
xvalid_cv = vect.transform(xvalid)

The tokenizer references this function:

stemmer = SnowballStemmer('english')

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    tokens = [i for i in tokens if all(j.isalpha() or j in string.punctuation for j in i)]
    tokens = [i for i in tokens if '/' not in i]
    stems = stem_tokens(tokens, stemmer)
    return stems

The %%timeit report:

24min 16s ± 28.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there anything that's obviously slowing me down? Any obvious inefficiencies would be good to know about. I'm thinking about reducing my n-gram range to (1,2) as I don't think I'm getting too many useful 3-gram features, but besides that I'm not sure what else to do.

Speeding up vectorization in sklearn

Answers (1)

Related Questions