R00
R00

Reputation: 73

How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer's vocabulary

I am building a pytorch BiLSTM that utilizes pre-trained gensim word2vec. I first used a nn.Embedding layer that was trained with the model from scratch but, i decided to use a pre-trained word2vec embeddings to improve accuracy. My model architecture follows a simple BiLSTM architecture, where the first layer is the embedding layer followed by a BiLSTM layer(s), and lastly two feed forward layers.

import torch
import gensim

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

word2vec = gensim.models.Word2Vec.load('path_to_word2vec/wikipedia_cbow_100')
weights = torch.FloatTensor(word2vec.wv.vectors)

class BiLSTM_model(torch.nn.Module) :
    def __init__(self, max_features, embedding_dim, hidden_dim, num_layers, lstm_dropout) :
        # max_features is the vocabulary size (num of tokens/words).
        super().__init__()
        # self.embeddings = nn.Embedding(max_features, embedding_dim, padding_idx=0)
        self.embeddings = nn.Embedding.from_pretrained(weights)
        self.lstm = nn.LSTM(word2vec.wv.vector_size,
                            hidden_dim,
                            batch_first=True,
                            bidirectional=True,
                            num_layers = num_layers,
                            dropout=lstm_dropout)
        self.relu=nn.ReLU()
        self.fc1 = nn.Linear(hidden_dim * 2, 64)
        self.dropout = nn.Dropout(0.2)
        self.fc2 = nn.Linear(64, config['num_classes'])

    def forward(self, input):
        embeddings_out = self.embeddings(input)
        lstm_out, (hidden, cell) = self.lstm(embeddings_out)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        rel = self.relu(hidden)
        dense1 = self.fc1(rel)
        drop = self.dropout(dense1)
        final_out = self.fc2(drop)

        return final_out

i use a keras tokenizer to tokenize the text and obtain the token ids.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

## Tokenize the sentences
tokenizer = Tokenizer(num_words=config['max_features'])
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)

finally i use a standard training loop with an optimizer and a loss function. The code runs fine but there are no performance gains from using the pre-trained embeddings.

I suspect that it has to do with token ids not matching between the keras.preprocessing.text tokenizer and the gensim pre-trained embeddings for the words. My question is, how do i confirm (or deny) this inconsistency and ,if it is the case, how do i handle the issue?

Note: i am using a custom word2vec embeddings for the Arabic language. You can find the embeddings here.

Upvotes: 1

Views: 649

Answers (1)

R00
R00

Reputation: 73

After looking into jhso's comment. It seems that the solution for this problem is to use word2vec.wv.index2word which will return the vocabulary (words) as a list sorted in an order which reflects a word's embedding. for example, the following code:

pretrained_embedding = gensim.models.Word2Vec.load('path/to/embedding')
word_vectors= pretrained_embedding.wv
for i in range (0,3):
  print(f"{i}: '{word_vectors.index2word[i]}'")

will print:

0: 'this'
1: 'is'
2: 'an'
3: 'example'

where this token will have the id 0 and so on.

You then use word2vec.wv.index2word as input to the keras.preprocessing.text.Tokenizer object's .fit_on_texts() method as following:

vocabulary = pretrained_embeddings.index2word
tokenizer = Tokenizer(num_words=config['max_features'])
tokenizer.fit_on_texts(vocabulary)

this should preserve the token ids between the gensim word2vec model and the keras tokenizer.

Upvotes: 0

Related Questions