blue-sky
blue-sky

Reputation: 53816

word2vec - get nearest words

Reading the tensorflow word2vec model output how can I output the words related to a specific word ?

Reading the src : https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/word2vec/word2vec_basic.py can view how the image is plotted.

But is there a data structure (e.g dictionary) created as part of training the model that allows to access nearest n words closest to given word ? For example if word2vec generated image :

enter image description here

image src: https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html

In this image the words 'to , he , it' are contained in same cluster, is there a function which takes as input 'to' and outputs 'he , it' (in this case n=2) ?

Upvotes: 12

Views: 20754

Answers (3)

Nate Raw
Nate Raw

Reputation: 731

I will assume that you don't want to use gensim, and would prefer to stick with tensorflow. In that case, I'll offer two options

Option 1 - Tensorboard:

If you are just trying to do this from an exploratory standpoint, I would suggest using Tensorboard's embedding visualizer to search for the closest embeddings. It provides a cool interface and you can use both cosine and euclidian distances with a set number of neighbors.

Tensorboard's embedding visualizer

Link to Tensorflow documentation

Option 2 - Direct Calculation

Within the word2vec_basic.py file, there is an example of how they are calculating closest words, and you could go ahead and use that if you mess with the function a little bit. The following is found in the graph itself:

# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
  normalized_embeddings, valid_dataset)
similarity = tf.matmul(
  valid_embeddings, normalized_embeddings, transpose_b=True)

Then, during training (every 10000 steps) they run this next bit of code (while the session is active). When they call similarity.eval() it is getting the literal numpy array evaluation of the similarity tensor in the graph.

# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
  sim = similarity.eval()
  for i in xrange(valid_size):
    valid_word = reverse_dictionary[valid_examples[i]]
    top_k = 8 # number of nearest neighbors
    nearest = (-sim[i, :]).argsort()[1:top_k+1]
    log_str = "Nearest to %s:" % valid_word
    for k in xrange(top_k):
      close_word = reverse_dictionary[nearest[k]]
      log_str = "%s %s," % (log_str, close_word)
    print(log_str)

If you want to adapt this for yourself, you will have to do some finessing with changing reverse_dictionary[valid_examples[i]] to be the word/words idxs that you want to get the k-closest words for.

Upvotes: 5

ngub05
ngub05

Reputation: 596

Get gensim and use similar_by_word method on gensim.models.Word2Vec model.

similar_by_word takes 3 parameters,

  1. The input word
  2. n - for top n similar words (optional, default=10)
  3. restrict_vocab (optional, default=None)

Example

import gensim, nltk

class FileToSent(object):
   """A class to load a text file efficiently """   
    def __init__(self, filename):
        self.filename = filename
        # To remove stop words (optional)
        self.stop = set(nltk.corpus.stopwords.words('english'))

    def __iter__(self):
        for line in open(self.filename, 'r'):
            ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
            yield ll

Then depending on your input sentences (sentence_file.txt),

sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, min_count=2, hs=1)
print model.similar_by_word('hack', 2) # Get two most similar words to 'hack'
# [(u'debug', 0.967338502407074), (u'patch', 0.952264130115509)] (Output specific to my dataset)

Upvotes: 2

Steven Du
Steven Du

Reputation: 1691

This approach apply to word2vec in general. If you can save the word2vec in text/binary file like google/GloVe word vector. Then what you need is just the gensim.

To install:

Via github

Python code:

from gensim.models import Word2Vec

gmodel=Word2Vec.load_word2vec_format(fname)
ms=gmodel.most_similar('good',10)
for x in ms:
    print x[0],x[1]

However this will search all the words to give the results, there are approximate nearest neighbor (ANN) which will give you the result faster but with a trade off in accuracy.

In the latest gensim, annoy is used to perform the ANN, see this notebooks for more information.

Flann is another library for Approximate Nearest Neighbors.

Upvotes: 14

Related Questions