Gensim.Similarity Add document or Live training

Question

A little background about this project. I have copies with an identifier and the text, e.g. {name: "sports-football", text: "Content related to football sports"}.

I need to find the right match for the given text input within this corpus. However, I was able to achieve somewhat using Gensim. Similarity with LDA and LSI Model.

How to update the Genism.Similarity Index with new a document. The idea here is to keep training the model at live stage.

Here is the step I followed.

QueryText = "Guardiola moved Lionel Messi to the No 9 role so that he didn't have to come deep and I think Aguero drops back into deeper positions too often."

Note: some codes are just layman

The index is created using

`similarities.Similarity(indexpath, model,topics)`

Create A dictionary

dictionary = Dictionary(QueryText )
Create a corpus

corpus = Corpus(QueryText, dictionary)
Create an LDA Model

LDAModel = ldaModel(corpus,dictionary)

Update existing dictionary, model, and index

Update existing dictionary

existing_dictionary.add_document(dictionary)

Update existing LDA Model

existing_lda_model.update(corpus)

Update existing Similarity index

existing_index.add_dcoument(LDAModel[corpus])

Other than below warning update seems to be worked.

gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

Let's run the similarity for the query text

vec_bow = dictionary.doc2bow(QueryText) 
vec_model = existing_lda_model[vec_bow] 
sims = existing_index[vec_model]

However, it failed with below error.

Similarity index with 723 documents in 1 shards (stored under \Files\models\lda_model)
Similarity index with 725 documents in 0 shards (stored under \Files\models\lda_model)
\lib\site-packages\gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2
  perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
 in ()
     45 trigram = Trigram.apply_trigram_model(queryText, bigram, trigram)
     46 vec_bow = dictionry.doc2bow(trigram)
---> 47 vec_model =  lda_model[vec_bow]
     48 print(vec_model)
     49 

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in __getitem__(self, bow, eps)
   1103             `(topic_id, topic_probability)` 2-tuples.
   1104         """
-> 1105         return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
   1106 
   1107     def save(self, fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs):

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in get_document_topics(self, bow, minimum_probability, minimum_phi_value, per_word_topics)
    944             return self._apply(corpus, **kwargs)
    945 
--> 946         gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
    947         topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    948 

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats)
    442             Elogthetad = Elogtheta[d, :]
    443             expElogthetad = expElogtheta[d, :]
--> 444             expElogbetad = self.expElogbeta[:, ids]
    445 
    446             # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.

IndexError: index 718 is out of bounds for axis 1 with size 713

I really appreciate, helping me with this. Looking forward to awesome replies.

Gensim.Similarity Add document or Live training

Answers (1)

Related Questions