Reputation: 210
A little background about this project. I have copies with an identifier and the text, e.g. {name: "sports-football", text: "Content related to football sports"}
.
I need to find the right match for the given text input within this corpus. However, I was able to achieve somewhat using Gensim. Similarity with LDA and LSI Model.
How to update the Genism.Similarity
Index with new a document. The idea here is to keep training the model at live stage.
Here is the step I followed.
QueryText = "Guardiola moved Lionel Messi to the No 9 role so that he didn't have to come deep and I think Aguero drops back into deeper positions too often."
Note: some codes are just layman
The index is created using
`similarities.Similarity(indexpath, model,topics)`
Create A dictionary
dictionary = Dictionary(QueryText )
Create a corpus
corpus = Corpus(QueryText, dictionary)
Create an LDA Model
LDAModel = ldaModel(corpus,dictionary)
Update existing dictionary, model, and index
Update existing dictionary
existing_dictionary.add_document(dictionary)
Update existing LDA Model
existing_lda_model.update(corpus)
Update existing Similarity index
existing_index.add_dcoument(LDAModel[corpus])
Other than below warning update seems to be worked.
gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words
Let's run the similarity for the query text
vec_bow = dictionary.doc2bow(QueryText)
vec_model = existing_lda_model[vec_bow]
sims = existing_index[vec_model]
However, it failed with below error.
Similarity index with 723 documents in 1 shards (stored under \Files\models\lda_model)
Similarity index with 725 documents in 0 shards (stored under \Files\models\lda_model)
\lib\site-packages\gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2
perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-8fe711724367> in <module>()
45 trigram = Trigram.apply_trigram_model(queryText, bigram, trigram)
46 vec_bow = dictionry.doc2bow(trigram)
---> 47 vec_model = lda_model[vec_bow]
48 print(vec_model)
49
~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in __getitem__(self, bow, eps)
1103 `(topic_id, topic_probability)` 2-tuples.
1104 """
-> 1105 return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
1106
1107 def save(self, fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs):
~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in get_document_topics(self, bow, minimum_probability, minimum_phi_value, per_word_topics)
944 return self._apply(corpus, **kwargs)
945
--> 946 gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
947 topic_dist = gamma[0] / sum(gamma[0]) # normalize distribution
948
~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats)
442 Elogthetad = Elogtheta[d, :]
443 expElogthetad = expElogtheta[d, :]
--> 444 expElogbetad = self.expElogbeta[:, ids]
445
446 # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
IndexError: index 718 is out of bounds for axis 1 with size 713
I really appreciate, helping me with this. Looking forward to awesome replies.
Upvotes: 4
Views: 850
Reputation: 16728
The later error (AssertionError: mismatch between supplied and computed number of non-zeros
in the sparse matrix) most likely comes from the issue suggested by the warning - perwordbound
overflows and the matrix calculated using its undefined value fails the update.
I suggest updating the model with larger batches (not a single query). There may be a disproportionate number of words, count of words in the model that you are trying to update with a relatively small number of words. For floats this may cause subtle errors.
Again, please try updating the model with batches of size proportionate to the model source data (e.g. 1/10th, 1/20th of its size).
Revision, based on this thread:
Melissa Roemmele wrote:
FYI, I also got this error when I tried to create an LSI index for a corpus on a bag-of-words corpus without first transforming it into tf-idf. I could build the LSI model on the bag-of-words but building the index for it gave me the error.
You may want to try tf-idf first before passing the QueryText
to the model.
Upvotes: 1