BERT problem with context/semantic search in italian language

Question

I am using BERT model for context search in Italian language but it does not understand the contextual meaning of the sentence and returns wrong result.

in below example code when I compare "milk with chocolate flavour" with two other type of milk and one chocolate so it returns high similarity with chocolate. it should return high similarity with other milks.

can anyone suggest me any improvement on the below code so that it can return semantic results?

Code :

!python -m spacy download it_core_news_lg
!pip install sentence-transformers


import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish

corpus = [
          "Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr(soya milk)
          "Milka  cioccolato al latte 100 g", #Milka milk chocolate 100 g
          "Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml(milk with chocolate flabor)
         ]
corpus_embeddings = model.encode(corpus)


queries = [
            'latte al cioccolato', #milk with chocolate flavor,
          ]
query_embeddings = model.encode(queries)


# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("
======================
")
    print("Query:", query)
    print("
Top 10 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

Output :

======================

Query: latte al cioccolato

Top 10 most similar sentences in corpus:
Milka  cioccolato al latte 100 g (Score: 0.7714)
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569)

igrinis · Accepted Answer

The problem is not with your code, it is just the insufficient model performance.

There are a few things you can do. First, you can try Universal Sentence Encoder (USE). From my experience their embeddings are a little bit better, at least in English.

Second, you can try a different model, for example sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1. It is based on ROBERTa and might give a better performance.

Now you can combine together embeddings from several models (just by concatenating the representations). In some cases it helps, on expense of much heavier compute.

And finally you can create your own model. It is well known that single language models perform significantly better than multilingual ones. You can follow the guide and train your own Italian model.

BERT problem with context/semantic search in italian language

Answers (1)

Related Questions