Reputation: 1250
I have two CSV files - train and test, with 18000 reviews each. I need to use the train file to do feature extraction and calculate the similarity metric between each review in the train file and each review in the test file.
I generated a vocabulary based on words from the train and test set - I eliminated stop words but did not remove typos and stem.
The problem I'm facing is - I don't know how to use the output from TfIdfVectorizer to generate cosine similarities between the train and test data.
This is the code snippet that fits my train data to vocabulary
:
vect = TfidfVectorizer(sublinear_tf=True, min_df=0.5, vocabulary=vocabulary)
X = vect.fit_transform(train_list)
vocab = vect.get_feature_names()
# train_matrix = X.todense()
train_idf = vect.idf_
print vocab
print X.todense()
The output I get from X.todense() is
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
If I simply print X, it looks like this :
(0, 28137) 0.114440020953
(0, 27547) 0.238913278498
(0, 26519) 0.14777362826
(0, 26297) 0.247716207254
(0, 26118) 0.178776605168
(0, 26032) 0.15139993147
(0, 25771) 0.10334152493
(0, 25559) 0.157584788446
(0, 25542) 0.0909693864147
(0, 25538) 0.179738937276
(0, 21762) 0.112899547719
(0, 21471) 0.159940534946
(0, 21001) 0.0931693893501
(0, 13960) 0.134069984961
(0, 12535) 0.198190713402
(0, 11918) 0.142570540903
: :
(18505, 18173) 0.237810781785
(18505, 17418) 0.233931974117
(18505, 17412) 0.129587180209
(18505, 17017) 0.130917070234
(18505, 17014) 0.137794139419
(18505, 15943) 0.130040669343
(18505, 15837) 0.0790013472346
(18505, 11865) 0.158061557865
(18505, 10896) 0.0708161593204
(18505, 10698) 0.0846731116968
(18505, 10516) 0.116681527108
(18505, 8668) 0.122364898181
(18505, 7956) 0.174450779875
(18505, 1111) 0.191477939381
(18505, 73) 0.257945257626
I don't know how to read the output from X.todense() or print X and I'm not sure how to find the cosine distance between the test and train sets (probably using pairwise similarity? - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html ?)
Edit:
I repeated the same steps for my test data.
Now I have two sparse matrices X and Y of type scipy.sparse.csr.csr_matrix
- but since they're both sparse and of type (doc, term) tf-idf
I can't directly get the cosine similarity between X and Y by direct multiplication.
Converting X and Y with todense()
gives me a MemoryError - which means it's inefficient.
What should I do next?
I need to get some sort of matrix with pairwise cosine similarities of dimensions 18000 * 18000, or a sparse matrix but I don't know how to do that.
This is for homework and no amount of reading sklearn documentation is helping me at this stage.
Upvotes: 3
Views: 3653
Reputation: 2816
I think you could use pariwise_distances
Here an example I am using:
tf = TfidfVectorizer(tokenizer=normalize, decode_error = 'ignore',max_features=10000)
tfidf_matrix = tf.fit_transform(aux['enlarged_description'])
#cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
X = pairwise_distances(tfidf_matrix, metric = metrics,n_jobs = -2 )
Upvotes: 0
Reputation: 95873
You are almost there. Using vect.fit_transform
returns a sparse-representation of a document-term matrix. It is the document-term matrix representation of your training set. You would then need to transform the testing set with the same model. Hint: use the transform
method on test_list
. You are in luck, because sklearn.metrics.pairwise.pairwise_distances(X, Y)
takes sparse matrices for X
and Y
when metric='euclidean'
is passed (i.e. the metric you want). It should be pretty straightforward what you need to do from here.
Upvotes: 2