Matt
Matt

Reputation: 53

Extract word/sentence probabilities from lm_1b trained model

I have successfully downloaded the 1B word language model trained using a CNN-LSTM (https://github.com/tensorflow/models/tree/master/research/lm_1b), and I would like to be able to input sentences or partial sentences to get the probability of each subsequent word in the sentence.

For example, if I have a sentence like, "An animal that says ", I'd like to know the probability of the next word being "woof" vs. "meow".

I understand that running the following produces the LSTM embeddings:

bazel-bin/lm_1b/lm_1b_eval --mode dump_lstm_emb \
                           --pbtxt data/graph-2016-09-10.pbtxt \
                           --vocab_file data/vocab-2016-09-10.txt \
                           --ckpt 'data/ckpt-*' \
                           --sentence "An animal that says woof" \                             
                           --save_dir output

That will produce files lstm_emb_step_*.npy where each file is the LSTM embedding for each word in the sentence. How can I transform these into probabilities over the trained model to be able to compare P(woof|An animal that says) vs. P(meow|An animal that says)?

Thanks in advance.

Upvotes: 4

Views: 499

Answers (1)

cheshirekow
cheshirekow

Reputation: 4907

I wanted to do the same thing and this is what I came up with, adapted from some of their demo code. I'm not entirely sure this is correct but it seems to produce reasonable values.

def get_probability_of_next_word(sess, t, vocab, prefix_words, query):
  """
  Return the probability of the given word based on the sequence of prefix 
  words. 

  :param sess: Tensorflow session object
  :param t: Tensorflow ??? object
  :param vocab: Vocabulary model, maps id <-> string, stores max word chard id length
  :param list prefix_words: List of words that appear before this one. 
  :param str query: The query word
  """
  targets = np.zeros([BATCH_SIZE, NUM_TIMESTEPS], np.int32)
  weights = np.ones([BATCH_SIZE, NUM_TIMESTEPS], np.float32)

  if not prefix_words or prefix_words[0] != "<S>":
    prefix_words.insert(0, "<S>")

  prefix = [vocab.word_to_id(w) for w in prefix_words]
  prefix_char_ids = [vocab.word_to_char_ids(w) for w in prefix_words]

  inputs = np.zeros([BATCH_SIZE, NUM_TIMESTEPS], np.int32)
  char_ids_inputs = np.zeros(
    [BATCH_SIZE, NUM_TIMESTEPS, vocab.max_word_length], np.int32)
  inputs[0, 0] = prefix[0]
  char_ids_inputs[0, 0, :] = prefix_char_ids[0]
  softmax = sess.run(t['softmax_out'],
                     feed_dict={t['char_inputs_in']: char_ids_inputs,
                                t['inputs_in']: inputs,
                                t['targets_in']: targets,
                                t['target_weights_in']: weights})

  return softmax[0, vocab.word_to_id(query)]

Example usage

vocab = CharsVocabulary(vocab_path, MAX_WORD_LEN)
sess, t = LoadModel(model_path, ckptdir + "/ckpt-*")
result = get_probability_of_next_word(sess, t, vocab, ["Hello", "my", "friend"], "for")

gives a result of 8.811023e-05. Note that CharsVocabulary and LoadModel are very slightly adapted from the ones in the repo.

Also note that this function is very slow. Maybe someone knows how to improve it.

Upvotes: 0

Related Questions