Reputation: 233
I have recently tried to use word2vec, i trained my model and gotten all the vectors assigned. However, I do not know how to find the value for each of the vector.
I try to print the model but it only output all the vectors it have trained. But, I still don't get it, I thought the vectors are based on each word but somehow everything is inside one list.
My understanding about word2vec is that each word(assume this W1) has their own vectors and with each vector it represents the similarity between the current word(W1) and word2(W2). Since each word is assigned with sparse vectors then it should consist lots of vectors for only W1. However, when I print my model, I received (maybe) for one word only but I am not sure which word is this. Can anyone please assist me?
My codes:
import collections
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
batch_size = 20
embedding_size = 2
num_sampled = 15
sentences = ["I have something that I want to say to him",
"How are you",
"We can see many stars tonight",
"That's our house",
"sung likes cats",
"she loves dogs",
"Do you know what he has done",
"cats are great companions when they want to be",
"We need to invest in clean, renewable energy",
"women love his man",
"queen love his king",
"girl love his boy",
"The line is too long. Why don't you come back tomorrow",
"man and women roam in park",
"Does it really matter",
"dynasty king remain mortal"]
words = " ".join(sentences).split()
count = collections.Counter(words).most_common()
# Build dictionaries
reverse_dictionary = [i[0] for i in count] #reverse dic, idx -> word
dic = {w: i for i, w in enumerate(reverse_dictionary)} #dic, word -> id
voc_size = len(dic)
data = [dic[word] for word in words]
cbow_pairs = []
for i in range(1, len(data)-1) :
cbow_pairs.append([[data[i-1], data[i+1]], data[i]])
skip_gram_pairs = []
for c in cbow_pairs:
skip_gram_pairs.append([c[1], c[0][0]])
skip_gram_pairs.append([c[1], c[0][1]])
def generate_batch (size):
assert size < len(skip_gram_pairs)
x_data=[]
y_data = []
r = np.random.choice(range(len(skip_gram_pairs)), size, replace=False)
for i in r:
x_data.append(skip_gram_pairs[i][0]) # n dim
y_data.append([skip_gram_pairs[i][1]]) # n, 1 dim
return x_data, y_data
# Input data
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
# Look up embeddings for inputs.
embeddings = tf.Variable(
tf.random_uniform([voc_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs) # lookup table
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.random_uniform([voc_size, embedding_size],-1.0, 1.0))
nce_biases = tf.Variable(tf.zeros([voc_size]))
# Compute the average NCE loss for the batch.
# This does the magic:
# tf.nn.nce_loss(weights, biases, inputs, labels, num_sampled, num_classes ...)
# It automatically draws negative samples when we evaluate the loss.
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, train_labels, embed, num_sampled, voc_size))
# Use the adam optimizer
train_op = tf.train.AdamOptimizer(1e-1).minimize(loss)
# Launch the graph in a session# Launch
with tf.Session() as sess:
# Initializing all variables
tf.global_variables_initializer().run()
for step in range(100):
batch_inputs, batch_labels = generate_batch(batch_size)
_, loss_val = sess.run([train_op, loss],
feed_dict={train_inputs: batch_inputs, train_labels: batch_labels})
# Final embeddings are ready for you to use. Need to normalize for practical use
trained_embeddings = embeddings.eval()
print(trained_embeddings)
Current output: This output somehow seems to be for only one single word and not for all the words within a corpus.
[[-0.751498 -1.4963825 ]
[-0.7022982 -1.4211462 ]
[-1.6240289 -0.96706766]
[-3.2109795 -1.2967492 ]
[-0.8835893 -1.5251521 ]
[-1.4316636 -1.4322135 ]
[-1.8665589 -1.1734825 ]
[-0.4726948 -1.836668 ]
[-0.11171409 -2.0847342 ]
[-1.0599283 -0.9792351 ]
[-1.6748023 -0.9584413 ]
[-0.8855507 -1.3226773 ]
[-0.9565117 -1.5730425 ]
[-1.2891663 -1.1687953 ]
[-0.06940217 -1.7782353 ]
[-0.92220575 -1.8264929 ]
[-3.2258956 -1.105678 ]
[-2.4262347 -0.9806146 ]
[-0.36716968 -2.3782976 ]
[-0.4972397 -1.9926786 ]
[-0.65995616 -1.2129989 ]
[-0.53334516 -1.5244756 ]
[-1.4961753 -0.5592766 ]
[-0.57391864 -1.9852302 ]
[-0.6580112 -1.0749325 ]
[-0.7821078 -1.598069 ]
[-1.264001 -1.002861 ]
[-0.23881587 -2.103974 ]
[-0.3729657 -1.9456012 ]
[-0.9266953 -1.516872 ]
[-1.4948957 -1.1232641 ]
[-1.109361 -1.3108519 ]
[-2.0748782 -0.93853486]
[-2.0241299 -0.8716516 ]
[-0.9448593 -1.0530868 ]
[-1.4578291 -0.57673496]
[-0.31915158 -1.4830168 ]
[-1.2568909 -1.0629684 ]
[-0.50458056 -2.2233846 ]
[-1.2059065 -1.0402468 ]
[-0.17204402 -1.8913956 ]
[-1.5484996 -1.0246676 ]
[-1.7026784 -1.4470854 ]
[-2.114282 -1.2304462 ]
[-1.6737207 -1.2598573 ]
[-0.9031189 -1.8086503 ]
[-1.4084693 -0.9171761 ]
[-1.261698 -1.5333931 ]
[-2.7891722 -0.69629264]
[-2.7634912 -1.0250676 ]
[-2.171037 -1.3402877 ]
[-1.5588827 -1.4741637 ]
[-2.012083 -1.6028976 ]
[-1.4286829 -1.485801 ]
[-0.06908941 -2.370034 ]
[-1.3277153 -1.2935033 ]
[-0.52055264 -1.2549478 ]
[-2.4971442 -0.6335571 ]
[-2.7244987 -0.6136059 ]
[-0.7155211 -1.8717885 ]
[-2.1862056 -0.78832203]
[-2.068198 -0.96536046]
[-0.9023069 -1.6741301 ]
[-0.39895654 -1.584905 ]
[-0.656657 -1.6787726 ]
[ 0.13354267 -2.105389 ]
[-1.248123 -1.7273897 ]
[-0.6168909 -1.3929827 ]
[-0.1866242 -2.0612721 ]
[-2.3246803 -1.1561321 ]
[ 0.88145804 0.35487294]]
Example expected output:
[-0.751498 -1.4963825 ] to show the values for these two vectors. Example, "how" or "are".
Upvotes: 1
Views: 734
Reputation: 54243
If you've trained a Word2Vec
model to learn 2-dimensional vectors per word, each word will have a 2-dimensional vector.
I can't evaluate your full implementation – you should probably be using a known-good off-the-shelf standard Word2Vec
library. Also, Word2Vec
really depends on large, diverse training data – toy-sized examples won't usually show the real behaviors and benefits.
But since your sentences
looks like it has a few dozen unique words, the output showing your full trained_embeddings
contains a few dozen 2-dimensional vectors seems about right.
If you just need one word's vector, you'd need to look it up, at whatever position in the full-set it was assigned before training.
Upvotes: 1