Keithx
Keithx

Reputation: 3158

Calculating cosine distance between the rows of matrix

I'm trying to calculate cosine distance in python between the rows in matrix and have couple a questions.So I'm creating matrix matr and populating it from the lists, then reshaping it for analysis purposes:

s = []

for i in range(len(a)):
    for j in range(len(b_list)):
        s.append(a[i].count(b_list[j]))

matr = np.array(s) 
d = matr.reshape((22, 254)) 

The output of d gives me smth like:

array([[0, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 1, 0, 0],
       [2, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

Then I want to use scipy.spatial.distance.cosine package to calculate cosine from first row to every other else in the d matrix. How can I perform that? Should it be some for loop for that? Not too much experience with matrix and array operations.

So how can I use for loop for second argument (d[1],d[2], and so on) in that construction not to launch it every time:

from scipy.spatial.distance import cosine
x=cosine (d[0], d[6])

Upvotes: 3

Views: 16572

Answers (3)

Chris Mueller
Chris Mueller

Reputation: 6690

You can just use a simple for loop with scipy.spatial.distance.cosine:

import scipy.spatial.distance

dists = []
for row in matr:
    dists.append(scipy.spatial.distance.cosine(matr[0,:], row))

Upvotes: 6

Tasos Papastylianou
Tasos Papastylianou

Reputation: 22245

Here's how you might calculate it easily by hand:

from numpy import array as a
from numpy.random import random_integers as randi
from numpy.linalg.linalg import norm
from numpy import set_printoptions

M = randi(10, size=a([5,5]));   # create demo matrix

# dot products of rows against themselves
DotProducts = M.dot(M.T);       

# kronecker product of row norms
NormKronecker = a([norm(M, axis=1)]) * a([norm(M, axis=1)]).T; 

CosineSimilarity = DotProducts / NormKronecker
CosineDistance = 1 - CosineSimilarity

set_printoptions(precision=2, suppress=True)
print CosineDistance 

Output:

[[-0.    0.15  0.1   0.11  0.22]
 [ 0.15  0.    0.15  0.13  0.06]
 [ 0.1   0.15  0.    0.15  0.14]
 [ 0.11  0.13  0.15  0.    0.18]
 [ 0.22  0.06  0.14  0.18 -0.  ]]

This matrix is e.g. interpreted as "the cosine distance between row 3 against row 2 (or, equally, row 2 against row 3) is 0.15".

Upvotes: 2

Warren Weckesser
Warren Weckesser

Reputation: 114921

You said "calculate cosine from first row to every other else in the d matrix" [sic]. If I understand correctly, you can do that with scipy.spatial.distance.cdist, passing the first row as the first argument and the remaining rows as the second argument:

In [31]: from scipy.spatial.distance import cdist

In [32]: matr = np.random.randint(0, 3, size=(6, 8))

In [33]: matr
Out[33]: 
array([[1, 2, 0, 1, 0, 0, 0, 1],
       [0, 0, 2, 2, 1, 0, 1, 1],
       [2, 0, 2, 1, 1, 2, 0, 2],
       [2, 2, 2, 2, 0, 0, 1, 2],
       [0, 2, 0, 2, 1, 0, 0, 0],
       [0, 0, 0, 1, 2, 2, 2, 2]])

In [34]: cdist(matr[0:1], matr[1:], metric='cosine')
Out[34]: array([[ 0.65811827,  0.5545646 ,  0.1752139 ,  0.24407105,  0.72499045]])

If it turns out that you want to compute all the pairwise distances in matr, you can use scipy.spatial.distance.pdist.

For example,

In [35]: from scipy.spatial.distance import pdist

In [36]: pdist(matr, metric='cosine')
Out[36]: 
array([ 0.65811827,  0.5545646 ,  0.1752139 ,  0.24407105,  0.72499045,
        0.36039785,  0.27625314,  0.49748109,  0.41498206,  0.2799177 ,
        0.76429774,  0.37117185,  0.41808563,  0.5765951 ,  0.67661917])

Note that the first five values returned by pdist are the same values returned above using cdist.

For further explanation of the return value of pdist, see How does condensed distance matrix work? (pdist)

Upvotes: 11

Related Questions