Dan
Dan

Reputation: 451

cosine similarity between string and list of strings

I have a string which I'm trying to compare to a list of strings. The objective is to rank the list of strings from most similar to least similar using cosine similarity.

original_string = 'abc'
string_list = ['abc', 'abcd', 'abec', 'ab', 'abcde', 'qwe']

That's my code so far:

sparse_matrix = tfidf_vectorizer.fit_transform(string_list)
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(
                 doc_term_matrix,
                 columns=tfidf_vectorizer.get_feature_names(),
                 index=[df_list],
        )
        
cosine = pd.DataFrame(cosine_similarity(df))

However this does pairwise between elements in the dataframe and I'm not sure how to transform the single string into a vector and add it as an argument to the cosine_similarity function as Y?

Upvotes: 2

Views: 5749

Answers (2)

StupidWolf
StupidWolf

Reputation: 46898

Hope I get you correct, so you need to vectorize it together with the list. One way is to append the string list to the test string. Once you have the matrix, its a matter of correlating the first row, to all other rows:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

string = 'aba'
string_list = ['abc', 'abcd', 'abec', 'ab', 'abcde', 'qwe']

tfidf_vectorizer = TfidfVectorizer(analyzer="char")

sparse_matrix = tfidf_vectorizer.fit_transform([string]+string_list)
cosine = cosine_similarity(sparse_matrix[0,:],sparse_matrix[1:,:])

Now we can put the results in a dataframe:

pd.DataFrame({'cosine':cosine[0],'strings':string_list}).sort_values('cosine',ascending=False)

cosine  strings
0   1.000000    abc
3   0.779625    ab
2   0.771964    abec
1   0.720181    abcd
4   0.619448    abcde
5   0.000000    qwe

Upvotes: 2

piterbarg
piterbarg

Reputation: 8219

I am not sure what you need DataFrames for. But the following code calculates cosine_transform of your string (which I renamed to tgt_string below as string is a bit too generic of a name against each element of string_list

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tgt_string = 'abc'
string_list = ['abc', 'abcd', 'abec', 'ab', 'abcde', 'qwe']

tfidf_vectorizer = TfidfVectorizer()
sparse_matrix = tfidf_vectorizer.fit_transform(string_list)
doc_term_matrix = sparse_matrix.toarray()

tgt_transform = tfidf_vectorizer.transform([tgt_string]).toarray()
tgt_cosine = cosine_similarity(doc_term_matrix,tgt_transform)
print(tgt_cosine)

this produces

[[1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]

which are cosine_simulatity-ies of tgt_string against each element of the list string_list. If you want it printed prettily in a dataframe, you can furthermore do

pd.DataFrame(index = string_list, data = tgt_cosine, columns = [tgt_string])

to generate


        abc
abc     1.0
abcd    0.0
abec    0.0
ab      0.0
abcde   0.0
qwe     0.0

Upvotes: 1

Related Questions