Reputation: 497
As we all know the capability of BERT
model for word embedding, it is probably better than the word2vec
and any other models.
I want to create a model on BERT
word embedding to generate synonyms or similar words. The same like we do in the Gensim
Word2Vec
. I want to create method of Gensim model.most_similar()
into BERT word embedding.
I researched a lot about it, seems that it is possible to do that, but the problem is it is only showing the embeddings in the form of number, there is no way to get the actual word from it. Can anybody help me regarding this?
Upvotes: 6
Views: 5986
Reputation: 1213
Bert uses tokens, which are not exactly the same as words. So a single word may not be just a single token.
Bert generates embedding vectors for each token with respect to other tokens within the context.
You can select a pretrained bert model and feed them single word get output and average them So you can get single vector for a word
Get list of words, calculate vector for each of them
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
word = "Hello"
inputs = tokenizer(word, return_tensors="pt")
outputs = model(**inputs)
word_vect = outputs.pooler_output.detach().numpy()
Upvotes: 3