Qasem Nick
Qasem Nick

Reputation: 167

Why is the sklearn LDA transform VERY SLOW?

I'm using sickit-learn Latent Dirichlet Allocation for topic modeling. The lda_object is fitted to a corpus of text. Now, we fit that to one text to understand the topic weights for it.

def append_lda_features(df, lda_vectorizer, tfidf+vector):

    from time import time
    st = time()

    lda_vector = lda_vectorizer.transform(tfidf_vector)

    print(time() - st)

    lda_vector = pd.DataFrame(lda_vector)
    lda_vector.columns = ['lda_word_'+str(i)
                           for i in range(lda_vectorizer.n_components)]
    return pd.concat([df, lda_vector], axis=1)

This is printing values around 0.67 seconds, which is really high. Considering that my lda only contains 15 components, and vectorizer has 100000 tokens:

LatentDirichletAllocation(n_components=15, n_jobs=30, verbose=1)

What should I do to make the LDA work faster?

Upvotes: 0

Views: 764

Answers (1)

Qasem Nick
Qasem Nick

Reputation: 167

When you're going to transform the lda on a single text vector, you better not set the n_jobs = 1.

In that way, it doesn't take that much time, as it doesn't need to parallelize the work, first. That apparently takes an obvious overhead.

def append_lda_features(df, lda_vectorizer, tfidf+vector):

    from time import time
    st = time()

    lda_vectorizer.n_jobs = 1
    lda_vector = lda_vectorizer.transform(tfidf_vector)

    print(time() - st)

    lda_vector = pd.DataFrame(lda_vector)
    lda_vector.columns = ['lda_word_'+str(i)
                           for i in range(lda_vectorizer.n_components)]
    return pd.concat([df, lda_vector], axis=1)

This one gives me about 0.01 seconds, instead on 0.6

Upvotes: 0

Related Questions