Hew
Hew

Reputation: 33

Distance between words in tensorflow embedding

I'd like to use one of the models on TensorFlow Hub to look at the distances between words (specifically this one https://tfhub.dev/google/nnlm-en-dim128/1). But I can't find a good example of how to find the distance between two words or two groups of words... is this something that is possible with an embedding like this?

I'm 100% not a Data Scientist and so this might be a complete lack of understanding so apologies if it's a dumb question.

Ideally I'd like to look at the distance of a single word compared to two different sets of words.

Upvotes: 1

Views: 2139

Answers (1)

Stewart_R
Stewart_R

Reputation: 14485

I think the most common measure of distance between two embedded vectors is the cosine similarity.

We can calculate the cosine similarity using the formula:

img of cosine distance formula from wikipedia page

which we can translate into tensorflow code as follows:

def cosine_similarity(a, b):
  mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
  mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
  return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)

so we have a complete example as follows:

import tensorflow as tf
import tensorflow_hub as hub

embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")
embeddings = embed(["cat is on the mat", "tiger sat on the mat"])

def cosine_similarity(a, b):
  mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
  mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
  return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)

a = embeddings[0]
b = embeddings[1]

cos_similarity = cosine_similarity(a, b)

with tf.Session() as sess:
  sess.run(tf.initialize_all_tables())
  sess.run(tf.global_variables_initializer())

  print (sess.run(cos_similarity))

which outputs 0.78157.

Note that some folks advocate using a rearrangement to the formula which gives the same results (+/- minuscule 'rounding errors') and may or may not be slightly better optimised.

This alternative formula is calculated as:

def cosine_similarity(a, b):
  norm_a = tf.nn.l2_normalize(a,0)        
  norm_b = tf.nn.l2_normalize(b,0)
  return tf.reduce_sum(tf.multiply(norm_a,norm_b))

Personally, I can't see how the difference could be anything other than negligible and I happen know the first formulation so I tend to stick with it but I certainly make no claim that its best and don't claim to know which is fastest! :-)

Upvotes: 1

Related Questions