Reputation: 3158
I'm trying to calculate cosine distance in python between the rows in matrix and have couple a questions.So I'm creating matrix matr and populating it from the lists, then reshaping it for analysis purposes:
s = []
for i in range(len(a)):
for j in range(len(b_list)):
s.append(a[i].count(b_list[j]))
matr = np.array(s)
d = matr.reshape((22, 254))
The output of d gives me smth like:
array([[0, 0, 0, ..., 0, 0, 0],
[2, 0, 0, ..., 1, 0, 0],
[2, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0]])
Then I want to use scipy.spatial.distance.cosine package to calculate cosine from first row to every other else in the d matrix. How can I perform that? Should it be some for loop for that? Not too much experience with matrix and array operations.
So how can I use for loop for second argument (d[1],d[2], and so on) in that construction not to launch it every time:
from scipy.spatial.distance import cosine
x=cosine (d[0], d[6])
Upvotes: 3
Views: 16572
Reputation: 6690
You can just use a simple for loop with scipy.spatial.distance.cosine
:
import scipy.spatial.distance
dists = []
for row in matr:
dists.append(scipy.spatial.distance.cosine(matr[0,:], row))
Upvotes: 6
Reputation: 22245
Here's how you might calculate it easily by hand:
from numpy import array as a
from numpy.random import random_integers as randi
from numpy.linalg.linalg import norm
from numpy import set_printoptions
M = randi(10, size=a([5,5])); # create demo matrix
# dot products of rows against themselves
DotProducts = M.dot(M.T);
# kronecker product of row norms
NormKronecker = a([norm(M, axis=1)]) * a([norm(M, axis=1)]).T;
CosineSimilarity = DotProducts / NormKronecker
CosineDistance = 1 - CosineSimilarity
set_printoptions(precision=2, suppress=True)
print CosineDistance
Output:
[[-0. 0.15 0.1 0.11 0.22]
[ 0.15 0. 0.15 0.13 0.06]
[ 0.1 0.15 0. 0.15 0.14]
[ 0.11 0.13 0.15 0. 0.18]
[ 0.22 0.06 0.14 0.18 -0. ]]
This matrix is e.g. interpreted as "the cosine distance between row 3 against row 2 (or, equally, row 2 against row 3) is 0.15".
Upvotes: 2
Reputation: 114921
You said "calculate cosine from first row to every other else in the d matrix" [sic]. If I understand correctly, you can do that with scipy.spatial.distance.cdist
, passing the first row as the first argument and the remaining rows as the second argument:
In [31]: from scipy.spatial.distance import cdist
In [32]: matr = np.random.randint(0, 3, size=(6, 8))
In [33]: matr
Out[33]:
array([[1, 2, 0, 1, 0, 0, 0, 1],
[0, 0, 2, 2, 1, 0, 1, 1],
[2, 0, 2, 1, 1, 2, 0, 2],
[2, 2, 2, 2, 0, 0, 1, 2],
[0, 2, 0, 2, 1, 0, 0, 0],
[0, 0, 0, 1, 2, 2, 2, 2]])
In [34]: cdist(matr[0:1], matr[1:], metric='cosine')
Out[34]: array([[ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045]])
If it turns out that you want to compute all the pairwise distances in matr
, you can use scipy.spatial.distance.pdist
.
For example,
In [35]: from scipy.spatial.distance import pdist
In [36]: pdist(matr, metric='cosine')
Out[36]:
array([ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045,
0.36039785, 0.27625314, 0.49748109, 0.41498206, 0.2799177 ,
0.76429774, 0.37117185, 0.41808563, 0.5765951 , 0.67661917])
Note that the first five values returned by pdist
are the same values returned above using cdist
.
For further explanation of the return value of pdist
, see How does condensed distance matrix work? (pdist)
Upvotes: 11