Reputation: 167
I'm using sickit-learn Latent Dirichlet Allocation for topic modeling. The lda_object is fitted to a corpus of text. Now, we fit that to one text to understand the topic weights for it.
def append_lda_features(df, lda_vectorizer, tfidf+vector):
from time import time
st = time()
lda_vector = lda_vectorizer.transform(tfidf_vector)
print(time() - st)
lda_vector = pd.DataFrame(lda_vector)
lda_vector.columns = ['lda_word_'+str(i)
for i in range(lda_vectorizer.n_components)]
return pd.concat([df, lda_vector], axis=1)
This is printing values around 0.67
seconds, which is really high. Considering that my lda only contains 15 components, and vectorizer has 100000 tokens:
LatentDirichletAllocation(n_components=15, n_jobs=30, verbose=1)
What should I do to make the LDA work faster?
Upvotes: 0
Views: 764
Reputation: 167
When you're going to transform the lda
on a single text vector, you better not set the n_jobs = 1
.
In that way, it doesn't take that much time, as it doesn't need to parallelize the work, first. That apparently takes an obvious overhead.
def append_lda_features(df, lda_vectorizer, tfidf+vector):
from time import time
st = time()
lda_vectorizer.n_jobs = 1
lda_vector = lda_vectorizer.transform(tfidf_vector)
print(time() - st)
lda_vector = pd.DataFrame(lda_vector)
lda_vector.columns = ['lda_word_'+str(i)
for i in range(lda_vectorizer.n_components)]
return pd.concat([df, lda_vector], axis=1)
This one gives me about 0.01
seconds, instead on 0.6
Upvotes: 0