Trent
Trent

Reputation: 53

Understanding get_sentence_vector() and get_word_vector() for fasttext

The thing I want to do is to get embeddings of a pair of words or phrases and calculate similarity.

I observed that the similarity is the same when I switch between get_sentence_vector() and get_word_vector() for a word. For example, I can switch the method when calculating embedding_2 or embedding_3, but embedding_2 and embedding_3 are not euqal, which is weird:

    from scipy.spatial.distance import cosine
    import numpy as np
    import fasttext
    import fasttext.util
    
    # download an english model
    fasttext.util.download_model('en', if_exists='ignore')  # English
    model = fasttext.load_model('cc.en.300.bin')
    
    # Getting word vectors for 'one' and 'two'.
    embedding_1 = model.get_sentence_vector('baby dog')
    embedding_2 = model.get_word_vector('puppy')
    embedding_3 = model.get_sentence_vector('puppy')
    
    def cosine_similarity(embedding_1, embedding_2):
        # Calculate the cosine similarity of the two embeddings.
        sim = 1 - cosine(embedding_1, embedding_2)
        print('Cosine similarity: {:.2}'.format(sim))
        
    # compare the embeddings
    cosine_similarity(embedding_1, embedding_2)
    # compare the embeddings
    cosine_similarity(embedding_1, embedding_3)
    
    
    # Checking if the two approaches yield the same result.
    is_equal = np.array_equal(embedding_2, embedding_3)
    
    # Printing the result.
    print(is_equal)

If I switch methods, similarity is always 0.76 but is_equal is false. I have two questions:

(1) I probably have to use get_sentence_vector() for phrases, but in terms of words, which one should I use? What happens when I call get_sentence_vector() for a word?

(2) I use fasttext because it can handle out of vocabulary, is it a good idea to use fasttext's embedding for cosine similarity comparison?

Upvotes: 0

Views: 6548

Answers (1)

  1. You should use get_word_vector for words and get_sentence_vector for sentences.

    get_sentence_vector divides each word vector by its norm and then average them. If you are interested in more details, read this.

  2. Since fastText provides vector representations, it is a good idea to use this vectors in order to compare words and sentences.

Upvotes: 4

Related Questions