Reputation: 53
The thing I want to do is to get embeddings of a pair of words or phrases and calculate similarity.
I observed that the similarity is the same when I switch between get_sentence_vector() and get_word_vector() for a word. For example, I can switch the method when calculating embedding_2 or embedding_3, but embedding_2 and embedding_3 are not euqal, which is weird:
from scipy.spatial.distance import cosine
import numpy as np
import fasttext
import fasttext.util
# download an english model
fasttext.util.download_model('en', if_exists='ignore') # English
model = fasttext.load_model('cc.en.300.bin')
# Getting word vectors for 'one' and 'two'.
embedding_1 = model.get_sentence_vector('baby dog')
embedding_2 = model.get_word_vector('puppy')
embedding_3 = model.get_sentence_vector('puppy')
def cosine_similarity(embedding_1, embedding_2):
# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(embedding_1, embedding_2)
print('Cosine similarity: {:.2}'.format(sim))
# compare the embeddings
cosine_similarity(embedding_1, embedding_2)
# compare the embeddings
cosine_similarity(embedding_1, embedding_3)
# Checking if the two approaches yield the same result.
is_equal = np.array_equal(embedding_2, embedding_3)
# Printing the result.
print(is_equal)
If I switch methods, similarity is always 0.76 but is_equal is false. I have two questions:
(1) I probably have to use get_sentence_vector() for phrases, but in terms of words, which one should I use? What happens when I call get_sentence_vector() for a word?
(2) I use fasttext because it can handle out of vocabulary, is it a good idea to use fasttext's embedding for cosine similarity comparison?
Upvotes: 0
Views: 6548
Reputation: 3536
You should use get_word_vector
for words and get_sentence_vector
for sentences.
get_sentence_vector
divides each word vector by its norm and then average them. If you are interested in more details, read this.
Since fastText provides vector representations, it is a good idea to use this vectors in order to compare words and sentences.
Upvotes: 4