Reputation: 67
I have a target string, say target = 'apple'
and I have a list of candidate strings, say candidate_list = ['orange', 'banana', 'apple1', 'pineapple']
. I am calculating cosine similarity between target
and iterating over each string in candidate_list
using the following code.
def calculate_cosine(c, h):
vec = CountVectorizer()
label_dictionary = vec.fit([c, h])
c_vector = label_dictionary.transform([c]).toarray()
h_vector = label_dictionary.transform([h]).toarray()
cx = lambda curr, hist: round(
numpy.inner(curr, hist) / numpy.LA.norm(curr) * numpy.LA.norm(hist), 3)
return cx(c_vector, h_vector)
My question is, is there a way to do this without iterating over the candidate_list
on the lines of Array Broadcasting or like Matrix Operation.
I am asking this since my current implementation (looping over the candidate_list
) is not fast enough for my application.
Thanks.
Upvotes: 4
Views: 1689
Reputation: 86320
Scikit-learn contains efficient code for computing the cosine similarity between groups of vectors; it's in the sklearn.metrics.pairwise
submodule.
Here's a fast approach for your problem:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_kernels
candidate_list = ['orange', 'banana', 'apple1', 'pineapple']
target = 'apple'
vec = CountVectorizer(analyzer='char')
vec.fit(candidate_list)
pairwise_kernels(vec.transform([target]),
vec.transform(candidate_list),
metric='cosine')
# array([[ 0.3086067 , 0.30304576, 0.93541435, 0.9166985 ]])
Note that I used CountVectorizer(analyzer='char')
to count characters rather than words, because it seemed more appropriate for your example data.
Upvotes: 4