Reputation: 643
I am using gensim 3.8.3 and seeing some weird results from the bm25 portion of that package. I realize this is an old version and maybe these results wouldn't hold in the updated gensim version, but I'm in a restricted environment and this is what I have access to. I notice that when creating a BM25 object such as below.
from gensim.summarization import bm25
from gensim.corpora import Dictionary
test_texts = ["hello how are you", "hi how are you", " hello what does the word hello mean", "why doesn't this work", "hello", "hello hello hello hello"]
texts = [doc.split() for doc in test_texts] # you can do preprocessing as removing stopwords
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text, allow_update = True) for text in texts]
bm25_model = bm25.BM25(corpus, k1= 1.75, b = 1)
That you can have multiple idf values for the same word. Which seems like it goes against what idf means. Can someone help me understand if this is a bug or just some sort of variant of BM25? Or correct me if I'm totally misunderstanding what's going on here.
word = 'hello'
word_id = dictionary.token2id.get(word)
word_frequency = dictionary.dfs[word_id]
idf_value = {k: v for k, v in bm25_model.idf.items() if k[0] == word_id}
idf_value
{(1, 1): 0.5877866649021191,
(1, 2): 1.2992829841302609,
(1, 4): 1.2992829841302609}
Upvotes: 0
Views: 64