Reputation: 81
I have a set of documents that describe different dimensions of corporate culture. Tokenized examples below:
sent1=['innovative','culture','fast','moving','company']
sent2=['manager','micromanage','all','time']
sent3=['slow','response','customer']
I've already applied Glove and Gensim w2v to the above documents. I'd like to identify documents that have high cosine similarity score to a sets of word, such as
Innovation =['innovate','innovative','fast']
How do I calculate the cosine similarities between each document (e.g. sent1, sent2) and Innovation
using Gensim?
Ideal Output:
innovation
sent1 0.98
sent2 0.45
sent3 -0.2
Upvotes: 0
Views: 775
Reputation: 4209
There are different methods when it comes to "cosine similarity between sets of documents". You can read some of the solutions here.
But if you want to calculate the CS between just two words, you can do this (were a
and b
are your vectors):
from numpy import dot
from numpy.linalg import norm
cos_sim = dot(a, b)/(norm(a)*norm(b))
Upvotes: 0