DevPy
DevPy

Reputation: 497

Compare cosine similarity of word with BERT model

Hi I am looking to generate similar words for a word using BERT model, the same approach we use in gensim to generate most_similar word, I found the approach as:

from transformers import BertTokenizer, BertModel

import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertModel.from_pretrained('bert-base-uncased')

word = "Hello"

inputs = tokenizer(word, return_tensors="pt")

outputs = model(**inputs)

word_vect = outputs.pooler_output.detach().numpy()

Okay, now this gives me the embedding for input word given by user, so can we compare this embedding with complete BERT model for cosine similarity to find top N embeddings that are closest match with that word, and then convert the embeddings to word using the vocab.txt file in the model? is it possible?

Upvotes: 2

Views: 4200

Answers (1)

pavelgein
pavelgein

Reputation: 82

Seems like you need to store embeddings for all word from your vocabulary. After that, you can use some tools to find the closest embedding to the target embedding. For example, you can use NearestNeighbors from scikit-learn. Another option you might like to consider is HNSW, which is the data structure specially designed to perform fast approximate nearest neighbour search. Faiss is a quite good implementation of HNSW by Facebook.

Upvotes: 1

Related Questions