Word2Vec: How to check the trained model for the value of the vector?

Question

I have recently tried to use word2vec, i trained my model and gotten all the vectors assigned. However, I do not know how to find the value for each of the vector.

I try to print the model but it only output all the vectors it have trained. But, I still don't get it, I thought the vectors are based on each word but somehow everything is inside one list.

My understanding about word2vec is that each word(assume this W1) has their own vectors and with each vector it represents the similarity between the current word(W1) and word2(W2). Since each word is assigned with sparse vectors then it should consist lots of vectors for only W1. However, when I print my model, I received (maybe) for one word only but I am not sure which word is this. Can anyone please assist me?

My codes:

import collections
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

batch_size = 20
embedding_size = 2
num_sampled = 15


sentences  = ["I have something that I want to say to him",
            "How are you",
            "We can see many stars tonight",
            "That's our house",
            "sung likes cats",
            "she loves dogs",
            "Do you know what he has done",
            "cats are great companions when they want to be",
            "We need to invest in clean, renewable energy",
            "women love his man",
            "queen love his king",
            "girl love his boy",
            "The line is too long. Why don't you come back tomorrow",
            "man and women roam in park",
            "Does it really matter",
            "dynasty king remain mortal"]

words = " ".join(sentences).split()
count = collections.Counter(words).most_common()
# Build dictionaries
reverse_dictionary = [i[0] for i in count] #reverse dic, idx -> word
dic = {w: i for i, w in enumerate(reverse_dictionary)} #dic, word -> id
voc_size = len(dic)
data = [dic[word] for word in words]


cbow_pairs = []
for i in range(1, len(data)-1) :
    cbow_pairs.append([[data[i-1], data[i+1]], data[i]])

    skip_gram_pairs = []
for c in cbow_pairs:
    skip_gram_pairs.append([c[1], c[0][0]])
    skip_gram_pairs.append([c[1], c[0][1]])



def  generate_batch (size):
    assert size < len(skip_gram_pairs)
    x_data=[]
    y_data = []
    r = np.random.choice(range(len(skip_gram_pairs)), size, replace=False)
    for i in r:
        x_data.append(skip_gram_pairs[i][0])  # n dim
        y_data.append([skip_gram_pairs[i][1]])  # n, 1 dim
    return x_data, y_data

# Input data
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
    # Look up embeddings for inputs.
    embeddings = tf.Variable(
        tf.random_uniform([voc_size, embedding_size], -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs) # lookup table

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.random_uniform([voc_size, embedding_size],-1.0, 1.0))
nce_biases = tf.Variable(tf.zeros([voc_size]))

# Compute the average NCE loss for the batch.
# This does the magic:
#   tf.nn.nce_loss(weights, biases, inputs, labels, num_sampled, num_classes ...)
# It automatically draws negative samples when we evaluate the loss.
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, train_labels, embed, num_sampled, voc_size))
# Use the adam optimizer
train_op = tf.train.AdamOptimizer(1e-1).minimize(loss)


# Launch the graph in a session# Launch 
with tf.Session() as sess:
    # Initializing all variables
    tf.global_variables_initializer().run()

    for step in range(100):
        batch_inputs, batch_labels = generate_batch(batch_size)
        _, loss_val = sess.run([train_op, loss],
                feed_dict={train_inputs: batch_inputs, train_labels: batch_labels})

    # Final embeddings are ready for you to use. Need to normalize for practical use
    trained_embeddings = embeddings.eval()
    print(trained_embeddings)

Current output: This output somehow seems to be for only one single word and not for all the words within a corpus.

[[-0.751498   -1.4963825 ]
 [-0.7022982  -1.4211462 ]
 [-1.6240289  -0.96706766]
 [-3.2109795  -1.2967492 ]
 [-0.8835893  -1.5251521 ]
 [-1.4316636  -1.4322135 ]
 [-1.8665589  -1.1734825 ]
 [-0.4726948  -1.836668  ]
 [-0.11171409 -2.0847342 ]
 [-1.0599283  -0.9792351 ]
 [-1.6748023  -0.9584413 ]
 [-0.8855507  -1.3226773 ]
 [-0.9565117  -1.5730425 ]
 [-1.2891663  -1.1687953 ]
 [-0.06940217 -1.7782353 ]
 [-0.92220575 -1.8264929 ]
 [-3.2258956  -1.105678  ]
 [-2.4262347  -0.9806146 ]
 [-0.36716968 -2.3782976 ]
 [-0.4972397  -1.9926786 ]
 [-0.65995616 -1.2129989 ]
 [-0.53334516 -1.5244756 ]
 [-1.4961753  -0.5592766 ]
 [-0.57391864 -1.9852302 ]
 [-0.6580112  -1.0749325 ]
 [-0.7821078  -1.598069  ]
 [-1.264001   -1.002861  ]
 [-0.23881587 -2.103974  ]
 [-0.3729657  -1.9456012 ]
 [-0.9266953  -1.516872  ]
 [-1.4948957  -1.1232641 ]
 [-1.109361   -1.3108519 ]
 [-2.0748782  -0.93853486]
 [-2.0241299  -0.8716516 ]
 [-0.9448593  -1.0530868 ]
 [-1.4578291  -0.57673496]
 [-0.31915158 -1.4830168 ]
 [-1.2568909  -1.0629684 ]
 [-0.50458056 -2.2233846 ]
 [-1.2059065  -1.0402468 ]
 [-0.17204402 -1.8913956 ]
 [-1.5484996  -1.0246676 ]
 [-1.7026784  -1.4470854 ]
 [-2.114282   -1.2304462 ]
 [-1.6737207  -1.2598573 ]
 [-0.9031189  -1.8086503 ]
 [-1.4084693  -0.9171761 ]
 [-1.261698   -1.5333931 ]
 [-2.7891722  -0.69629264]
 [-2.7634912  -1.0250676 ]
 [-2.171037   -1.3402877 ]
 [-1.5588827  -1.4741637 ]
 [-2.012083   -1.6028976 ]
 [-1.4286829  -1.485801  ]
 [-0.06908941 -2.370034  ]
 [-1.3277153  -1.2935033 ]
 [-0.52055264 -1.2549478 ]
 [-2.4971442  -0.6335571 ]
 [-2.7244987  -0.6136059 ]
 [-0.7155211  -1.8717885 ]
 [-2.1862056  -0.78832203]
 [-2.068198   -0.96536046]
 [-0.9023069  -1.6741301 ]
 [-0.39895654 -1.584905  ]
 [-0.656657   -1.6787726 ]
 [ 0.13354267 -2.105389  ]
 [-1.248123   -1.7273897 ]
 [-0.6168909  -1.3929827 ]
 [-0.1866242  -2.0612721 ]
 [-2.3246803  -1.1561321 ]
 [ 0.88145804  0.35487294]]

Example expected output:

[-0.751498 -1.4963825 ] to show the values for these two vectors. Example, "how" or "are".

Word2Vec: How to check the trained model for the value of the vector?

Answers (1)

Related Questions