Reputation: 1043
I want to compute cosine distance between each rows in a pandas dataframe. Before computing the distance, i want to select only elements in vectors which are > 0 and intersects (have values in both rows). For example, row1 [0,1,45,0,0] and row2 [4,11,2,0,0]. in this case, the program will only compute cosine distance between [1,45] and [11,2]. Here is my script, but this takes a long time to complete. Any help on simplifying the script and reducing processing time is appreciated.
data = df.values
m, k = data.shape
dist = np.zeros((m, m))
for i in range(m):
for j in range(i,m):
if i!=j:
vec1 = data[i,:]
vec2 = data[j,:]
pairs = [(x, y) for (x, y) in zip(vec1, vec2) if x > 0 and y > 0]
if pairs:
sub_list_1, sub_list_2 = map(list, zip(*pairs))
dist[i][j] = dist[j][i]=cosine(sub_list_1, sub_list_2)
else:
dist[i][j]= dist[j][i] =1
else:
dist[i][j]=0
Upvotes: 1
Views: 1702
Reputation: 221624
From the cosine docs
we have the following info -
scipy.spatial.distance.cosine(u, v) : Computes the Cosine distance between 1-D arrays.
The Cosine distance between u
and v
, is defined as
where u⋅v
is the dot product of u
and v
.
Using the above formula, we would have one vectorized solution using NumPy's broadcasting
, like so -
def self_cosine_vectorized(a):
dots = a.dot(a.T)
sqrt_sums = np.sqrt((a**2).sum(1))
cosine_dists = 1 - (dots/sqrt_sums)/sqrt_sums[:,None]
np.fill_diagonal(cosine_dists,0)
return cosine_dists
Thus, to get dist
-
dist = self_cosine_vectorized(df.values)
Runtime test and verification
Original approach :
def original_app(data):
m, k = data.shape
dist = np.zeros((m, m))
for i in range(m):
for j in range(m):
if i!=j:
vec1 = data[i,:]
vec2 = data[j,:]
pairs = [(x, y) for (x, y) in zip(vec1, vec2) if x > 0 and y > 0]
if pairs:
sub_list_1, sub_list_2 = map(list, zip(*pairs))
dist[i][j] = cosine(sub_list_1, sub_list_2)
else:
dist[i][j]
else:
dist[i][j]=0
return dist
Timings and verification -
In [203]: data = np.random.rand(100,100)
In [204]: np.allclose(original_app(data), self_cosine_vectorized(data))
Out[204]: True
In [205]: %timeit original_app(data)
1 loops, best of 3: 813 ms per loop
In [206]: %timeit self_cosine_vectorized(data)
10000 loops, best of 3: 101 µs per loop
In [208]: 813000.0/101
Out[208]: 8049.504950495049
Crazy 8000x+
speedup there!
Upvotes: 1