Is this one-hot encoding in TensorFlow fast? Or flawed for any reason?

Question

There are a few stack overflow questions about computing one-hot embeddings with TensorFlow, and here is the accepted solution:

num_labels = 10
sparse_labels = tf.reshape(label_batch, [-1, 1])
derived_size = tf.shape(label_batch)[0]
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
outshape = tf.reshape(tf.concat(0, [derived_size, [num_labels]]), [-1])
labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)

This is almost identical to the code in an official tutorial: https://www.tensorflow.org/versions/0.6.0/tutorials/mnist/tf/index.html

To me it seems that since tf.nn.embedding_lookup exists, it's probably more efficient. Here's a version that uses this, and it supports arbitrarily-shaped inputs:

def one_hot(inputs, num_classes):
    with tf.device('/cpu:0'):
        table = tf.constant(np.identity(num_classes, dtype=np.float32))
        embeddings = tf.nn.embedding_lookup(table, inputs)
    return embeddings

Do you expect this implementation to be faster? And is it flawed for any other reason?

mrry · Accepted Answer

The one_hot() function in your question looks correct. However, the reason that we do not recommend writing code this way is that it is very memory inefficient. To understand why, let's say you have a batch size of 32, and 1,000,000 classes.

In the version suggested in the tutorial, the largest tensor will be the result of tf.sparse_to_dense(), which will be 32 x 1000000.
In the one_hot() function in the question, the largest tensor will be the result of np.identity(1000000), which is 4 terabytes. Of course, allocating this tensor probably won't succeed. Even if the number of classes were much smaller, it would still waste memory to store all of those zeroes explicitly—TensorFlow does not automatically convert your data to a sparse representation even though it might be profitable to do so.

Finally, I want to offer a plug for a new function that was recently added to the open-source repository, and will be available in the next release. tf.nn.sparse_softmax_cross_entropy_with_logits() allows you to specify a vector of integers as the labels, and saves you from having to build the dense one-hot representation. It should be much more efficient that either solution for large numbers of classes.

Is this one-hot encoding in TensorFlow fast? Or flawed for any reason?

Answers (1)

Related Questions