DevPy
DevPy

Reputation: 497

Using BERT to generate similar word or synonyms through word embeddings

As we all know the capability of BERT model for word embedding, it is probably better than the word2vec and any other models.

I want to create a model on BERT word embedding to generate synonyms or similar words. The same like we do in the Gensim Word2Vec. I want to create method of Gensim model.most_similar() into BERT word embedding.

I researched a lot about it, seems that it is possible to do that, but the problem is it is only showing the embeddings in the form of number, there is no way to get the actual word from it. Can anybody help me regarding this?

Upvotes: 6

Views: 5986

Answers (1)

Birol Kuyumcu
Birol Kuyumcu

Reputation: 1213

  1. Bert uses tokens, which are not exactly the same as words. So a single word may not be just a single token.

  2. Bert generates embedding vectors for each token with respect to other tokens within the context.

  3. You can select a pretrained bert model and feed them single word get output and average them So you can get single vector for a word

  4. Get list of words, calculate vector for each of them

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

word = "Hello"
inputs = tokenizer(word, return_tensors="pt")
outputs = model(**inputs)
word_vect = outputs.pooler_output.detach().numpy()
  1. calculate vector distances, so you can get similar words from distances

Upvotes: 3

Related Questions