Reputation: 1806

Finding cosine similarity between 2 numbered datasets using Python

I have numbered datasets of length 22 where each number can lie between 0 to 1 where it represents the percentage of that attribute.

[0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]


[0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]


[0.01, 0.07, 0.59, 0.2, 0, 0, 0, 0, 0, 0.05, 0, 0, 0, 0, 0, 0, 0.07, 0, 0, 0, 0, 0]


[0.55, 0.12, 0.26, 0.01, 0, 0, 0, 0.01, 0.02, 0, 0, 0.01, 0, 0, 0.01, 0, 0.01, 0, 0, 0, 0, 0]


[0, 0.46, 0.43, 0.05, 0, 0, 0, 0, 0, 0, 0, 0.02, 0, 0, 0, 0, 0.02, 0.02, 0, 0, 0, 0]

How can I calculate the cosine similarity between such 2 datasets using Python?

Upvotes: 2

Answers (3)

Rubbal

Reputation: 809

You can use the method directly from sklearn

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.asmatrix([1,2,3]), np.asmatrix([4,5,6]))[0][0]

Output

0.97463184619707621

Note (Since numpy methods generally operate on matrices) If you do not use np.asmatrix(), you will get the following warning

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample

And to get the final value as a scalar, you need to use [0][0] on the output,

Upvotes: 0

6502

Reputation: 114461

Without depending on numpy you could go with

result = (sum(ax*bx for ax, bx in a, b) /
          (sum(ax**2 for ax in a) +
           sum(bx**2 for bx in b))**0.5)

Upvotes: 1

Falko

Reputation: 17867

According to the definition of Cosine similarity you just need to compute the normalized dot product of the two vectors a and b:

import numpy as np

a = [0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]
b = [0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]

print np.dot(a, b) / np.linalg.norm(a) / np.linalg.norm(b)

Output:

0.115081383219

Upvotes: 4

Finding cosine similarity between 2 numbered datasets using Python

Answers (3)

Related Questions