Reputation: 33
I'd like to use one of the models on TensorFlow Hub to look at the distances between words (specifically this one https://tfhub.dev/google/nnlm-en-dim128/1). But I can't find a good example of how to find the distance between two words or two groups of words... is this something that is possible with an embedding like this?
I'm 100% not a Data Scientist and so this might be a complete lack of understanding so apologies if it's a dumb question.
Ideally I'd like to look at the distance of a single word compared to two different sets of words.
Upvotes: 1
Views: 2139
Reputation: 14485
I think the most common measure of distance between two embedded vectors is the cosine similarity.
We can calculate the cosine similarity using the formula:
which we can translate into tensorflow code as follows:
def cosine_similarity(a, b):
mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)
so we have a complete example as follows:
import tensorflow as tf
import tensorflow_hub as hub
embed = hub.Module("https://tfhub.dev/google/nnlm-en-dim128/1")
embeddings = embed(["cat is on the mat", "tiger sat on the mat"])
def cosine_similarity(a, b):
mag_a = tf.sqrt(tf.reduce_sum(tf.multiply(a, a)))
mag_b = tf.sqrt(tf.reduce_sum(tf.multiply(b, b)))
return tf.reduce_sum(tf.multiply(a, b)) / (mag_a * mag_b)
a = embeddings[0]
b = embeddings[1]
cos_similarity = cosine_similarity(a, b)
with tf.Session() as sess:
sess.run(tf.initialize_all_tables())
sess.run(tf.global_variables_initializer())
print (sess.run(cos_similarity))
which outputs 0.78157
.
Note that some folks advocate using a rearrangement to the formula which gives the same results (+/- minuscule 'rounding errors') and may or may not be slightly better optimised.
This alternative formula is calculated as:
def cosine_similarity(a, b):
norm_a = tf.nn.l2_normalize(a,0)
norm_b = tf.nn.l2_normalize(b,0)
return tf.reduce_sum(tf.multiply(norm_a,norm_b))
Personally, I can't see how the difference could be anything other than negligible and I happen know the first formulation so I tend to stick with it but I certainly make no claim that its best and don't claim to know which is fastest! :-)
Upvotes: 1