Reputation: 395
I have two different text which I want to compare using tfidf vectorization. What I am doing is:
Now the vectors that I get after step 2 are of different shape. But as per the concept, we should have the same shape for both the vectors. Only then the vectors can be compared.
What am I doing wrong? Please help.
Thanks in advance.
Upvotes: 4
Views: 5157
Reputation: 21
I'm one of those later people :)
So my understanding with TF-IDF is the IDF is computed the frequency of the word (or Ngram) in both documents? So comparing what matches with each, doesn't really cover how common the word is in both documents for weeding out common words? Is there a way to do that with Ngrams without the indice error?
ValueError: Shape of passed values is (26736, 1), indices imply (60916, 1)
# Applying TFIDF to vectors
#instantiate tfidVectorizers()
ngram_vectorizer1 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 1st vector
ngram_vectorizer2 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 2nd
ngram_vectorizert = TfidfVectorizer(ngram_range = (2,2)) #bigrams total
# fit model
ngram_vector1 = ngram_vectorizer1.fit_transform(text)
ngram_vector2 = ngram_vectorizer2.fit_transform(text2)
ngram_vectort = ngram_vectorizert.fit_transform(total)
ngramfeatures1 = (ngram_vectorizer1.get_feature_names()) #save feature names
ngramfeatures2 = (ngram_vectorizer2.get_feature_names()) #save feature names
ngramfeaturest = (ngram_vectorizert.get_feature_names())
print("\n\nngramfeatures1 : \n", ngramfeatures1)
print("\n\nngramfeatures2 : \n", ngramfeatures2)
print("\n\nngram_vector1 : \n", ngram_vector1.toarray())
print("\n\nngram_vector2 : \n", ngram_vector2.toarray())
#Compute the IDF values
first_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
second_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
total_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
first_tfidf_transformer_ngram.fit(ngram_vector1)
second_tfidf_transformer_ngram.fit(ngram_vector2)
total_tfidf_transformer_ngram.fit(ngram_vectort)
# print 1st idf values
ngram_first_idf = pd.DataFrame(first_tfidf_transformer_ngram.idf_, index=ngram_vectorizer1.get_feature_names(),columns=["idf_weights"])
# sort ascending
ngram_first_idf.sort_values(by=['idf_weights']) #this one should really be looking towards something from the "Total" calculations if I'm understanding it correctly? ```
Upvotes: 0
Reputation: 395
As G. Anderson already pointed out, and to help the future guys on this, when we use the fit function of TFIDFVectorizer on document D1, it means that for the D1, the bag of words are constructed.
The transform() function computes the tfidf frequency of each word in the bag of word.
Now our aim is to compare the document D2 with D1. It means we want to see how many words of D1 match up with D2. Thats why we perform fit_transform() on D1 and then only the transform() function on D2 would apply the bag of words of D1 and count the inverse frequency of tokens in D2. This would give the relative comparison of D1 against D2.
Upvotes: 4