Reputation: 58
I'm trying to compute the cosine similarity between two arrays, but there is a slight alteration to the base formula. Namely, I only care about components that overlap with a "reference" array. For example, if we were to compute the cosine similarity between the following two arrays:
A = [1 0 1] B = [1 1 0]
[0 1 1] [0 1 1]
Let's say B
is the reference array. Then A
changes with respect to each row in B, to only include components that overlap with that row. E.g., the first row in B
is [1 1 0]
, so the similarity computation is with the modified A matrix:
[1 0 0]
[0 1 0]
To compute the next similarities with [0 1 1]
, the modified A
becomes:
[0 0 1]
[0 1 1]
My question is: is there a way to introduce this modification without drastically slowing down performance (when compared to built-in cosine similarity options like sklearn.metrics.pairwise.cosine_similarity
)? I understand nothing will be as fast as a standard cosine similarity computation, but right now my attempts to introduce this change have resulted in a slowdown of nearly 100x, so any improvement on that would be great.
I don't really know of any way to do this outside of going row-by-row through the reference array, masking the other array according to the current row, and then doing matrix-vector cosine similarity. Something like this:
def modified_cosine_sim(arr1, arr2):
# arr2 is reference array
final_arr = []
for row in arr2:
masked_arr1 = arr1 * np.where(row > 0, 1, 0)
final_arr.append(cosine_similarity(masked_arr1, row))
return final_arr
This is pretty inefficient, though. I checked to see if there was some clever way to modify the sklearn cosine_similarity
code to accomplish the goal here, but that code relies on normalizing both arrays before the computation proceeds, and I can't really do that -- arr1
effectively changes throughout the computation, depending on the row in arr2
that's currently being used to compute similarities.
I have to run this computation repeatedly on somewhat large arrays, so any optimization tips would be greatly appreciated. Or if this computation corresponds to some already optimized built-in function I'm unfamiliar with, that would be even better. Thanks!
Upvotes: 0
Views: 518
Reputation: 53029
The following implements your modified formula using mostly matrix multiplication.
def modified_similarity(a,b):
bc = np.maximum(b,0)/np.linalg.norm(b,axis=1,keepdims=True)
return [email protected]/np.sqrt(np.square(a)@np.sign(bc).T)
Upvotes: 1
Reputation: 5757
I believe the following code and your function modified_cosine_sim
are equivalent.
def faster_cosine_sim(arr1, arr2):
return cosine_similarity(arr1 * np.where(arr2 > 0, 1, 0), arr2)
Upvotes: 0