whs2k
whs2k

Reputation: 771

TFIDIF Model Creation TypeError in Gensim

TypeError: 'TfidfModel' object is not callable

Why can I not compute the TFIDF Matrix for each Doc after initializing?

I started with 999 documents: 999 paragraphs with about 5-15 sentences each. After spaCy tokenizing everything, I created the dictionary (~16k unique tokens) and corpus (a list of lists of tuples)

Now I'm ready to create the tfidf matrix (and later LDA and w2V matricies) for some ML; however, after initializing the tfidf model with my corpus (for calculation of the 'IDF') tfidf = models.TfidfModel(corpus) I get the following error message when trying to see the tfidf of each doc tfidf(corpus[5]) TypeError: 'TfidfModel' object is not callable

I am able to create this model using a differnt corpus where i have four docs each comprised of only a sentence. There I can confirm that the expected corpus fomat is a list of lists of tuples: [doc1[(word1, count),(word2, count),...], doc2[(word3, count),(word4,count),...]...]

from gensim import corpora, models, similarities

texts = [['teenager', 'martha', 'moxley'...], ['ok','like','kris','usual',...]...]
dictionary = corpora.Dictionary(texts)
>>> Dictionary(15937 unique tokens: ['teenager', 'martha', 'moxley']...)

corpus = [dictionary.doc2bow(text) for text in texts]
>>> [[(0, 2),(1, 2),(2, 1)...],[(3, 1),(4, 1)...]...]

tfidf = models.TfidfModel(corpus)
>>> TfidfModel(num_docs=999, num_nnz=86642)

tfidf(corpus[0])
>>> TypeError: 'TfidfModel' object is not callable

corpus[0]
>>> [(0, 2),(1, 2),(2, 1)...]

print(type(corpus),type(corpus[1]),type(corpus[1][3]))
>>> <class 'list'> <class 'list'> <class 'tuple'>

Upvotes: 1

Views: 1276

Answers (2)

scipilot
scipilot

Reputation: 7467

Expanding on @whs2k's answer, the square bracket syntax is used to form a transformation wrapper around the corpus, forming a kind of lazy processing pipeline.

I didn't get it until I read the note in this tutorial: https://radimrehurek.com/gensim/tut2.html

Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling corpus_transformed = model[corpus], because that would mean storing the result in main memory, and that contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

But I still don't feel I fully understand the underlying Python list magic.

Upvotes: 0

whs2k
whs2k

Reputation: 771

Instead of: tfidf(corpus[0])

Try: tfidf[corpus[0]]

Upvotes: 2

Related Questions