mfd
mfd

Reputation: 58

Modified cosine similarity efficiency

The Problem

I'm trying to compute the cosine similarity between two arrays, but there is a slight alteration to the base formula. Namely, I only care about components that overlap with a "reference" array. For example, if we were to compute the cosine similarity between the following two arrays:

A = [1 0 1]     B = [1 1 0]
    [0 1 1]         [0 1 1]

Let's say B is the reference array. Then A changes with respect to each row in B, to only include components that overlap with that row. E.g., the first row in B is [1 1 0], so the similarity computation is with the modified A matrix:

[1 0 0]
[0 1 0]

To compute the next similarities with [0 1 1], the modified A becomes:

[0 0 1]
[0 1 1]

My question is: is there a way to introduce this modification without drastically slowing down performance (when compared to built-in cosine similarity options like sklearn.metrics.pairwise.cosine_similarity)? I understand nothing will be as fast as a standard cosine similarity computation, but right now my attempts to introduce this change have resulted in a slowdown of nearly 100x, so any improvement on that would be great.

Attempts

I don't really know of any way to do this outside of going row-by-row through the reference array, masking the other array according to the current row, and then doing matrix-vector cosine similarity. Something like this:

def modified_cosine_sim(arr1, arr2):
    # arr2 is reference array
    final_arr = []
    for row in arr2:
        masked_arr1 = arr1 * np.where(row > 0, 1, 0)
        final_arr.append(cosine_similarity(masked_arr1, row))

    return final_arr 

This is pretty inefficient, though. I checked to see if there was some clever way to modify the sklearn cosine_similarity code to accomplish the goal here, but that code relies on normalizing both arrays before the computation proceeds, and I can't really do that -- arr1 effectively changes throughout the computation, depending on the row in arr2 that's currently being used to compute similarities.

I have to run this computation repeatedly on somewhat large arrays, so any optimization tips would be greatly appreciated. Or if this computation corresponds to some already optimized built-in function I'm unfamiliar with, that would be even better. Thanks!

Upvotes: 0

Views: 518

Answers (2)

Paul Panzer
Paul Panzer

Reputation: 53029

The following implements your modified formula using mostly matrix multiplication.

def modified_similarity(a,b):
    bc = np.maximum(b,0)/np.linalg.norm(b,axis=1,keepdims=True)
    return [email protected]/np.sqrt(np.square(a)@np.sign(bc).T)

Upvotes: 1

abhilb
abhilb

Reputation: 5757

I believe the following code and your function modified_cosine_sim are equivalent.

def faster_cosine_sim(arr1, arr2):
    return cosine_similarity(arr1 * np.where(arr2 > 0, 1, 0), arr2)

Upvotes: 0

Related Questions