Reputation: 9752
I want to use TfidfVectorizer() on a file that contains many lines, each a phrase. I then want to take a test file with a small subset of phrases, do TfidfVectorizer() and then take the cosine similarity between the original and the test file so that for a given phrase in the test file, I retrieve the top N matches within the original file. Here is my attempt:
corpus = tuple(open("original.txt").read().split('\n'))
test = tuple(open("test.txt").read().split('\n'))
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(corpus)
tfidf_matrix2 = tf.fit_transform(test)
from sklearn.metrics.pairwise import linear_kernel
def new_find_similar(tfidf_matrix2, index, tfidf_matrix, top_n = 5):
cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten()
related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index]
return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]
for index, score in find_similar(tfidf_matrix, 1234567):
print score, corpus[index]
However I get:
for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix):
print score, test[index]
Traceback (most recent call last):
File "<ipython-input-53-2bf1cd465991>", line 1, in <module>
for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix):
File "<ipython-input-51-da874b8d3076>", line 2, in new_find_similar
cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten()
File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 734, in linear_kernel
X, Y = check_pairwise_arrays(X, Y)
File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 122, in check_pairwise_arrays
X.shape[1], Y.shape[1]))
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 66662 while Y.shape[1] == 3332088
I wouldn't mind combining both files and then transforming, but I want to b sure I do not compare any of the phrases from the test file with in of the other phrases within the test file.
Any pointers?
Upvotes: 1
Views: 1354
Reputation: 1314
Fit the TfidfVectorizer
with data from corpus, then transform the test data with the already fitted vectorizer (i.e., do not call fit_transform
twice):
tfidf_matrix = tf.fit_transform(corpus)
tfidf_matrix2 = tf.transform(test)
Upvotes: 4