kitchenprinzessin
kitchenprinzessin

Reputation: 1043

Iterate pandas rows and compute cosine distance between rows

I want to compute cosine distance between each rows in a pandas dataframe. Before computing the distance, i want to select only elements in vectors which are > 0 and intersects (have values in both rows). For example, row1 [0,1,45,0,0] and row2 [4,11,2,0,0]. in this case, the program will only compute cosine distance between [1,45] and [11,2]. Here is my script, but this takes a long time to complete. Any help on simplifying the script and reducing processing time is appreciated.

data = df.values
m, k = data.shape
dist = np.zeros((m, m))
for i in range(m):
    for j in range(i,m):
        if i!=j:
            vec1 = data[i,:]
            vec2 = data[j,:]
            pairs = [(x, y) for (x, y) in zip(vec1, vec2) if x > 0 and y > 0]
            if pairs:
                sub_list_1, sub_list_2 = map(list, zip(*pairs))
                dist[i][j] = dist[j][i]=cosine(sub_list_1, sub_list_2)
            else:
                dist[i][j]= dist[j][i] =1
        else:
            dist[i][j]=0 

Upvotes: 1

Views: 1702

Answers (1)

Divakar
Divakar

Reputation: 221624

From the cosine docs we have the following info -

scipy.spatial.distance.cosine(u, v) : Computes the Cosine distance between 1-D arrays.

The Cosine distance between u and v, is defined as

enter image description here

where u⋅v is the dot product of u and v.

Using the above formula, we would have one vectorized solution using NumPy's broadcasting, like so -

def self_cosine_vectorized(a):
    dots = a.dot(a.T)
    sqrt_sums = np.sqrt((a**2).sum(1))
    cosine_dists = 1 - (dots/sqrt_sums)/sqrt_sums[:,None]
    np.fill_diagonal(cosine_dists,0)
    return cosine_dists

Thus, to get dist -

dist = self_cosine_vectorized(df.values)  

Runtime test and verification

Original approach :

def original_app(data):
    m, k = data.shape
    dist = np.zeros((m, m))
    for i in range(m):
        for j in range(m):
            if i!=j:
                vec1 = data[i,:]
                vec2 = data[j,:]
                pairs = [(x, y) for (x, y) in zip(vec1, vec2) if x > 0 and y > 0]
                if pairs:
                    sub_list_1, sub_list_2 = map(list, zip(*pairs))
                    dist[i][j] = cosine(sub_list_1, sub_list_2)
                else:
                    dist[i][j]
            else:
                dist[i][j]=0 
    return dist

Timings and verification -

In [203]: data = np.random.rand(100,100)

In [204]: np.allclose(original_app(data), self_cosine_vectorized(data))
Out[204]: True

In [205]: %timeit original_app(data)
1 loops, best of 3: 813 ms per loop

In [206]: %timeit self_cosine_vectorized(data)
10000 loops, best of 3: 101 µs per loop

In [208]: 813000.0/101
Out[208]: 8049.504950495049

Crazy 8000x+ speedup there!

Upvotes: 1

Related Questions