MEGHA MISHRA
MEGHA MISHRA

Reputation: 265

Getting embedding matrix of all zeros after performing word embedding on any input data

I am trying to do word embeddings in Keras. I am using 'glove.6B.50d.txt' for the purpose. I am able to get correct output till the preparation of embedding index from the "glove.6B.50d.txt" file.

But I'm always getting embedding matrix full of zeros whenever I map the word from the input provided by me to that in the embedding index.

Here is the code:

#here is the example sentence given as input

line="The quick brown fox jumped over the lazy dog"
line=line.split(" ")

#this is my embedding file
EMBEDDING_FILE='glove.6B.50d.txt'

embed_size = 10 # how big is each word vector
max_features = 10000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 10 # max number of words in a comment to use


tokenizer = Tokenizer(num_words=max_features,split=" ",char_level=False)
tokenizer.fit_on_texts(list(line))
list_tokenized_train = tokenizer.texts_to_sequences(line)
sequences = tokenizer.texts_to_sequences(line)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

print(sequences)
print(word_index)
print('Shape of data tensor:', X_t.shape)

#got correct output here as 

 # Found 8 unique tokens.
    #[[1], [2], [3], [4], [5], [6], [1], [7], [8]]
    #{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8}
   # Shape of data tensor: (9, 10)


#loading the embedding file to prepare embedding index matrix
embeddings_index = {}
for i in open(EMBEDDING_FILE, "rb"):
    values = i.split()
    word = values[0]
    #print(word)
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

#Found 400000 word vectors.

#making the embedding matrix

embedding_matrix = np.zeros((len(word_index) + 1, embed_size))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Here when I print the embedding matrix ,I get all zeros in it (i.e not a single word in input is recognized).

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Also if I print the embeddings_index.get(word) for each iteration, it is unable to fetch the word and returns NONE.

Where am I going wrong in the code?

Upvotes: 1

Views: 1664

Answers (2)

MEGHA MISHRA
MEGHA MISHRA

Reputation: 265

Got the problem solved today. Seems like embeddings_index.get(word) was unable to get the word because of some encoding issues.

I changed for i in open(EMBEDDING_FILE, "rb"): present in the preparation of embedding matrix to for i in open(EMBEDDING_FILE, 'r', encoding='utf-8'): and this solved the problem.

Upvotes: 0

Anirudh Bhardwaj
Anirudh Bhardwaj

Reputation: 21

  1. The embed size should be 50 not 10 (it indicates the dimensionality of the word embedding )
  2. The number of features should >>50 (make it close to 10,000). Restricting it to 50 means a whole lot of the vectors will be missing

Upvotes: 1

Related Questions