Word Embedding: Dimensions of word-id list

Question

I would like to perform word embedding with pretrained glove embeddings which I have downloaded here.

I am using a 6 word sentence as a test, and set a max doc length of 30. I am using the learn.preprocessing.VocabularyProcessor() object to learn the token-id dictionary. I am using the transform() method of this object to transform the input sentence to a list of word ids, so that I can look them up in the embedding matrix.

Why does the VocabularyProcessor.transform() method return a 6 x 30 array? I would expect it to simply return a list of the ids, for each of the words in the test sentence.

#show vocab and embedding
print('vocab size:%d
' % vocab_size)
print('embedding dim:%d
' %embedding_dim)
#test input
test_input_sentence="the cat sat on the mat"
test_words_list=test_input_sentence.split()
print (test_words_list)

#create embedding matrix W, and define a placeholder to be fed
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
            trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)
print('initalised embedding')
print(embedding_init.get_shape())

with tf.Session() as sess:
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
    #init a vocab processor object
    vocab_processor =   learn.preprocessing.VocabularyProcessor(max_document_length)
#fit = Learn a vocabulary dictionary of all tokens in the raw documents.
pretrain = vocab_processor.fit(vocab)
print('vocab preprocessor done')
#transform input to word-id matrix.
x = np.array(list(vocab_processor.transform(test_words_list)))

print('word id list shape:') 
print (x.shape)
print('embedding tensor shape:')
print(W.get_shape())
vec=tf.nn.embedding_lookup(W,x)
print ('vectors shape:')
print (vec.get_shape())
print ('embeddings:')
print (sess.run(vec))

ERed · Accepted Answer

From the comments in the code in https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/contrib/learn/python/learn/preprocessing/text.py in the transform() function:

"""Transforms input documents into sequence of ids.
Args:
  X: iterator or list of input documents.
    Documents can be bytes or unicode strings, which will be encoded as
    utf-8 to map to bytes. Note, in Python2 str and bytes is the same type.
Returns:
  iterator of byte ids.
"""

Since you are passing a list of tokens and the function is expecting a list of documents, each word in your list is treated as a document and hence has shape 6x30.

Word Embedding: Dimensions of word-id list

Answers (1)

Related Questions