What semantic similarity word-to-vec model is most consistent across languages?

Question

Problem statement: Looking for a Semantic Similarity Word-To-Vec model that is consistent across languages

Model used: from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') "

Context: Semantic similarity is used in different applications, such as search.
Multilingual word-to-vec model shows different semantic similarity scores in English and Spanish for the same content.

In English, the dot product of the vectors for ‘Drink’ and ‘Coffee’ is high enough to signify a relationship between these two terms.

If these two terms are translated to Spanish, the dot product drops.

When used in a phrase, similarity score improves. The word-to-vec model leads to high similarity between generic terms for drink and a specific drink in both English and Spanish.

When tested by replacing the object, drink, with book, or cup with book, the similarity score in English is 0.42 and in Spanish 0.65. These two cases indicate that the similarity is not because of the object, only because of the phrase.

In the picture, I also attached the phrases and the dot product. enter image description here

I tried embedding phrases into vectors using this model:

from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2')

I expected the cosine similarity of phrases to be consistent across languages.

What semantic similarity word-to-vec model is most consistent across languages?

Answers (0)

Related Questions