Reputation: 771
TypeError: 'TfidfModel' object is not callable
Why can I not compute the TFIDF Matrix for each Doc after initializing?
I started with 999 documents: 999 paragraphs with about 5-15 sentences each. After spaCy tokenizing everything, I created the dictionary (~16k unique tokens) and corpus (a list of lists of tuples)
Now I'm ready to create the tfidf matrix (and later LDA and w2V matricies) for some ML; however, after initializing the tfidf model with my corpus (for calculation of the 'IDF')
tfidf = models.TfidfModel(corpus)
I get the following error message when trying to see the tfidf of each doc tfidf(corpus[5])
TypeError: 'TfidfModel' object is not callable
I am able to create this model using a differnt corpus where i have four docs each comprised of only a sentence. There I can confirm that the expected corpus fomat is a list of lists of tuples: [doc1[(word1, count),(word2, count),...], doc2[(word3, count),(word4,count),...]...]
from gensim import corpora, models, similarities
texts = [['teenager', 'martha', 'moxley'...], ['ok','like','kris','usual',...]...]
dictionary = corpora.Dictionary(texts)
>>> Dictionary(15937 unique tokens: ['teenager', 'martha', 'moxley']...)
corpus = [dictionary.doc2bow(text) for text in texts]
>>> [[(0, 2),(1, 2),(2, 1)...],[(3, 1),(4, 1)...]...]
tfidf = models.TfidfModel(corpus)
>>> TfidfModel(num_docs=999, num_nnz=86642)
tfidf(corpus[0])
>>> TypeError: 'TfidfModel' object is not callable
corpus[0]
>>> [(0, 2),(1, 2),(2, 1)...]
print(type(corpus),type(corpus[1]),type(corpus[1][3]))
>>> <class 'list'> <class 'list'> <class 'tuple'>
Upvotes: 1
Views: 1276
Reputation: 7467
Expanding on @whs2k's answer, the square bracket syntax is used to form a transformation wrapper around the corpus, forming a kind of lazy processing pipeline.
I didn't get it until I read the note in this tutorial: https://radimrehurek.com/gensim/tut2.html
Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. We cannot convert the entire corpus at the time of calling corpus_transformed = model[corpus], because that would mean storing the result in main memory, and that contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed corpus_transformed multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.
But I still don't feel I fully understand the underlying Python list magic.
Upvotes: 0