Reputation: 451
I have a string which I'm trying to compare to a list of strings. The objective is to rank the list of strings from most similar to least similar using cosine similarity.
original_string = 'abc'
string_list = ['abc', 'abcd', 'abec', 'ab', 'abcde', 'qwe']
That's my code so far:
sparse_matrix = tfidf_vectorizer.fit_transform(string_list)
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(
doc_term_matrix,
columns=tfidf_vectorizer.get_feature_names(),
index=[df_list],
)
cosine = pd.DataFrame(cosine_similarity(df))
However this does pairwise between elements in the dataframe and I'm not sure how to transform the single string into a vector and add it as an argument to the cosine_similarity function as Y?
Upvotes: 2
Views: 5749
Reputation: 46898
Hope I get you correct, so you need to vectorize it together with the list. One way is to append the string list to the test string. Once you have the matrix, its a matter of correlating the first row, to all other rows:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
string = 'aba'
string_list = ['abc', 'abcd', 'abec', 'ab', 'abcde', 'qwe']
tfidf_vectorizer = TfidfVectorizer(analyzer="char")
sparse_matrix = tfidf_vectorizer.fit_transform([string]+string_list)
cosine = cosine_similarity(sparse_matrix[0,:],sparse_matrix[1:,:])
Now we can put the results in a dataframe:
pd.DataFrame({'cosine':cosine[0],'strings':string_list}).sort_values('cosine',ascending=False)
cosine strings
0 1.000000 abc
3 0.779625 ab
2 0.771964 abec
1 0.720181 abcd
4 0.619448 abcde
5 0.000000 qwe
Upvotes: 2
Reputation: 8219
I am not sure what you need DataFrames for. But the following code calculates cosine_transform of your string
(which I renamed to tgt_string
below as string
is a bit too generic of a name against each element of string_list
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
tgt_string = 'abc'
string_list = ['abc', 'abcd', 'abec', 'ab', 'abcde', 'qwe']
tfidf_vectorizer = TfidfVectorizer()
sparse_matrix = tfidf_vectorizer.fit_transform(string_list)
doc_term_matrix = sparse_matrix.toarray()
tgt_transform = tfidf_vectorizer.transform([tgt_string]).toarray()
tgt_cosine = cosine_similarity(doc_term_matrix,tgt_transform)
print(tgt_cosine)
this produces
[[1.]
[0.]
[0.]
[0.]
[0.]
[0.]]
which are cosine_simulatity
-ies of tgt_string against each element of the list string_list
. If you want it printed prettily in a dataframe, you can furthermore do
pd.DataFrame(index = string_list, data = tgt_cosine, columns = [tgt_string])
to generate
abc
abc 1.0
abcd 0.0
abec 0.0
ab 0.0
abcde 0.0
qwe 0.0
Upvotes: 1