sunny_kid
sunny_kid

Reputation: 67

Optimal way to calculate cosine similarity between a target string to list of strings - Python

I have a target string, say target = 'apple' and I have a list of candidate strings, say candidate_list = ['orange', 'banana', 'apple1', 'pineapple']. I am calculating cosine similarity between target and iterating over each string in candidate_list using the following code.

def calculate_cosine(c, h):
   vec = CountVectorizer()
   label_dictionary = vec.fit([c, h])
   c_vector = label_dictionary.transform([c]).toarray()
   h_vector = label_dictionary.transform([h]).toarray()

   cx = lambda curr, hist: round(
      numpy.inner(curr, hist) / numpy.LA.norm(curr) * numpy.LA.norm(hist), 3)

   return cx(c_vector, h_vector) 

My question is, is there a way to do this without iterating over the candidate_list on the lines of Array Broadcasting or like Matrix Operation. I am asking this since my current implementation (looping over the candidate_list) is not fast enough for my application. Thanks.

Upvotes: 4

Views: 1689

Answers (1)

jakevdp
jakevdp

Reputation: 86320

Scikit-learn contains efficient code for computing the cosine similarity between groups of vectors; it's in the sklearn.metrics.pairwise submodule.

Here's a fast approach for your problem:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_kernels

candidate_list = ['orange', 'banana', 'apple1', 'pineapple']
target = 'apple'

vec = CountVectorizer(analyzer='char')
vec.fit(candidate_list)

pairwise_kernels(vec.transform([target]),
                 vec.transform(candidate_list),
                 metric='cosine')
# array([[ 0.3086067 ,  0.30304576,  0.93541435,  0.9166985 ]])

Note that I used CountVectorizer(analyzer='char') to count characters rather than words, because it seemed more appropriate for your example data.

Upvotes: 4

Related Questions