How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer's vocabulary

Question

I am building a pytorch BiLSTM that utilizes pre-trained gensim word2vec. I first used a nn.Embedding layer that was trained with the model from scratch but, i decided to use a pre-trained word2vec embeddings to improve accuracy. My model architecture follows a simple BiLSTM architecture, where the first layer is the embedding layer followed by a BiLSTM layer(s), and lastly two feed forward layers.

import torch
import gensim

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

word2vec = gensim.models.Word2Vec.load('path_to_word2vec/wikipedia_cbow_100')
weights = torch.FloatTensor(word2vec.wv.vectors)

class BiLSTM_model(torch.nn.Module) :
    def __init__(self, max_features, embedding_dim, hidden_dim, num_layers, lstm_dropout) :
        # max_features is the vocabulary size (num of tokens/words).
        super().__init__()
        # self.embeddings = nn.Embedding(max_features, embedding_dim, padding_idx=0)
        self.embeddings = nn.Embedding.from_pretrained(weights)
        self.lstm = nn.LSTM(word2vec.wv.vector_size,
                            hidden_dim,
                            batch_first=True,
                            bidirectional=True,
                            num_layers = num_layers,
                            dropout=lstm_dropout)
        self.relu=nn.ReLU()
        self.fc1 = nn.Linear(hidden_dim * 2, 64)
        self.dropout = nn.Dropout(0.2)
        self.fc2 = nn.Linear(64, config['num_classes'])

    def forward(self, input):
        embeddings_out = self.embeddings(input)
        lstm_out, (hidden, cell) = self.lstm(embeddings_out)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        rel = self.relu(hidden)
        dense1 = self.fc1(rel)
        drop = self.dropout(dense1)
        final_out = self.fc2(drop)

        return final_out

i use a keras tokenizer to tokenize the text and obtain the token ids.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

## Tokenize the sentences
tokenizer = Tokenizer(num_words=config['max_features'])
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)

finally i use a standard training loop with an optimizer and a loss function. The code runs fine but there are no performance gains from using the pre-trained embeddings.

I suspect that it has to do with token ids not matching between the keras.preprocessing.text tokenizer and the gensim pre-trained embeddings for the words. My question is, how do i confirm (or deny) this inconsistency and ,if it is the case, how do i handle the issue?

Note: i am using a custom word2vec embeddings for the Arabic language. You can find the embeddings here.

How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer's vocabulary

Answers (1)

Related Questions

How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer&#39;s vocabulary

Answers (1)

Related Questions

How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer's vocabulary