Reputation: 303
I would like to perform word embedding with pretrained glove embeddings which I have downloaded here.
I am using a 6 word sentence as a test, and set a max doc length of 30. I am using the learn.preprocessing.VocabularyProcessor() object to learn the token-id dictionary. I am using the transform() method of this object to transform the input sentence to a list of word ids, so that I can look them up in the embedding matrix.
Why does the VocabularyProcessor.transform() method return a 6 x 30 array? I would expect it to simply return a list of the ids, for each of the words in the test sentence.
#show vocab and embedding
print('vocab size:%d\n' % vocab_size)
print('embedding dim:%d\n' %embedding_dim)
#test input
test_input_sentence="the cat sat on the mat"
test_words_list=test_input_sentence.split()
print (test_words_list)
#create embedding matrix W, and define a placeholder to be fed
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)
print('initalised embedding')
print(embedding_init.get_shape())
with tf.Session() as sess:
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
#init a vocab processor object
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit = Learn a vocabulary dictionary of all tokens in the raw documents.
pretrain = vocab_processor.fit(vocab)
print('vocab preprocessor done')
#transform input to word-id matrix.
x = np.array(list(vocab_processor.transform(test_words_list)))
print('word id list shape:')
print (x.shape)
print('embedding tensor shape:')
print(W.get_shape())
vec=tf.nn.embedding_lookup(W,x)
print ('vectors shape:')
print (vec.get_shape())
print ('embeddings:')
print (sess.run(vec))
Upvotes: 0
Views: 761
Reputation: 628
From the comments in the code in https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/contrib/learn/python/learn/preprocessing/text.py in the transform()
function:
"""Transforms input documents into sequence of ids.
Args:
X: iterator or list of input documents.
Documents can be bytes or unicode strings, which will be encoded as
utf-8 to map to bytes. Note, in Python2 str and bytes is the same type.
Returns:
iterator of byte ids.
"""
Since you are passing a list of tokens and the function is expecting a list of documents, each word in your list is treated as a document and hence has shape 6x30.
Upvotes: 1