Yvonne
Yvonne

Reputation: 81

Calculate cosine similarity between sets of document and key words (e.g. "innovate' "fast")

I have a set of documents that describe different dimensions of corporate culture. Tokenized examples below:

sent1=['innovative','culture','fast','moving','company']
sent2=['manager','micromanage','all','time']
sent3=['slow','response','customer']

I've already applied Glove and Gensim w2v to the above documents. I'd like to identify documents that have high cosine similarity score to a sets of word, such as Innovation =['innovate','innovative','fast']

How do I calculate the cosine similarities between each document (e.g. sent1, sent2) and Innovation using Gensim?

Ideal Output:

       innovation
sent1  0.98
sent2  0.45
sent3  -0.2

Upvotes: 0

Views: 775

Answers (1)

Peyman
Peyman

Reputation: 4209

There are different methods when it comes to "cosine similarity between sets of documents". You can read some of the solutions here.

But if you want to calculate the CS between just two words, you can do this (were a and b are your vectors):

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))

Upvotes: 0

Related Questions