rafine
rafine

Reputation: 471

euclidian distance from word to sentence after doing Vectorizer

I have dataframe with 1000 text rows.

I did TfidfVectorizer.

Now I want to create a new field which give me the distance from each sentence to the word that i want, lets say the word "king". df['king']

I thought about taking in each sentence the 5 closet words to the word king and make average of them.

I will glad to know how to do that or to hear about another method.

Upvotes: 1

Views: 43

Answers (1)

I am not convinced that the Euclidean distance would be the optimal measure. I would actually look at similarity scores:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

data = {
    'text': [
        "The king sat on the throne with wisdom.",
        "A queen ruled the kingdom alongside the king.",
        "Knights were loyal to their king.",
        "The empire prospered under the rule of a wise monarch."
    ]
}
df = pd.DataFrame(data)

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['text'])

try:
    king_vector = tfidf.transform(["king"]).toarray()
except KeyError:
    print("The word 'king' is not in the vocabulary.")
    king_vector = np.zeros((1, tfidf_matrix.shape[1]))

similarities = cosine_similarity(tfidf_matrix, king_vector).flatten()

feature_names = np.array(tfidf.get_feature_names_out())

def get_top_n_words(row_vector, top_n=5):
    indices = row_vector.argsort()[::-1][:top_n]
    return feature_names[indices]

averages = []
for i in range(tfidf_matrix.shape[0]):
    sentence_vector = tfidf_matrix[i].toarray().flatten()
    top_words = get_top_n_words(sentence_vector)
    top_similarities = [cosine_similarity(tfidf.transform([word]), king_vector).flatten()[0] for word in top_words]
    averages.append(np.mean(top_similarities))

df['king_similarity'] = similarities
df['avg_closest_similarity'] = averages

print(df)

which would give you

                                                text  king_similarity  \
0            The king sat on the throne with wisdom.         0.240614   
1      A queen ruled the kingdom alongside the king.         0.259779   
2                  Knights were loyal to their king.         0.274487   
3  The empire prospered under the rule of a wise ...         0.000000   

   avg_closest_similarity  
0                     0.0  
1                     0.0  
2                     0.0  
3                     0.0  

That being said, if you absolutely want to focus on Euclidean distance, here is a method:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.spatial.distance import euclidean

data = {
    'text': [
        "The king sat on the throne with wisdom.",
        "A queen ruled the kingdom alongside the king.",
        "Knights were loyal to their king.",
        "The empire prospered under the rule of a wise monarch."
    ]
}
df = pd.DataFrame(data)

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df['text']).toarray()

feature_names = tfidf.get_feature_names_out()
if "king" in feature_names:
    king_index = np.where(feature_names == "king")[0][0]
    king_vector = np.zeros_like(tfidf_matrix[0])
    king_vector[king_index] = 1
else:
    print("The word 'king' is not in the vocabulary.")
    king_vector = np.zeros_like(tfidf_matrix[0])

df['king_distance'] = [euclidean(sentence_vector, king_vector) for sentence_vector in tfidf_matrix]

print(df)

which gives

                                                text  king_distance
0            The king sat on the throne with wisdom.       1.232385
1      A queen ruled the kingdom alongside the king.       1.216734
2                  Knights were loyal to their king.       1.204586
3  The empire prospered under the rule of a wise ...       1.414214

Upvotes: 1

Related Questions