Reputation: 5275
I am using BERT model for context search in Italian language but it does not understand the contextual meaning of the sentence and returns wrong result.
in below example code when I compare "milk with chocolate flavour" with two other type of milk and one chocolate so it returns high similarity with chocolate. it should return high similarity with other milks.
can anyone suggest me any improvement on the below code so that it can return semantic results?
Code :
!python -m spacy download it_core_news_lg
!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish
corpus = [
"Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr(soya milk)
"Milka cioccolato al latte 100 g", #Milka milk chocolate 100 g
"Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml(milk with chocolate flabor)
]
corpus_embeddings = model.encode(corpus)
queries = [
'latte al cioccolato', #milk with chocolate flavor,
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
Output :
======================
Query: latte al cioccolato
Top 10 most similar sentences in corpus:
Milka cioccolato al latte 100 g (Score: 0.7714)
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569)
Upvotes: 1
Views: 568
Reputation: 13666
The problem is not with your code, it is just the insufficient model performance.
There are a few things you can do. First, you can try Universal Sentence Encoder (USE). From my experience their embeddings are a little bit better, at least in English.
Second, you can try a different model, for example sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1
. It is based on ROBERTa and might give a better performance.
Now you can combine together embeddings from several models (just by concatenating the representations). In some cases it helps, on expense of much heavier compute.
And finally you can create your own model. It is well known that single language models perform significantly better than multilingual ones. You can follow the guide and train your own Italian model.
Upvotes: 3