Harrison
Harrison

Reputation: 2670

How to use embedding_lookup with a batch of different length sequences in Tensorflow?

Say I have an embedding tensor:

emb = [[1,1],
       [2,2],
       [3,3],
       [4,4]]
emb = tf.constant(emb)

I have a list of sequences:

inputs = [[0,1,2,3]
          [3,2]]

I'd like to lookup the emb and pad zeros to make each sequence has the same length:

  [[[1, 1],
    [2, 2],
    [3, 3],
    [4, 4]],

   [[4, 4],
    [3, 3],
    [0, 0],
    [0, 0]]]

I tried tf.nn.embedding_lookup, but got an error:

ValueError: Argument must be a dense tensor: [[0, 1, 2, 3], [3, 2]] - got shape [2], but wanted [2, 4].

Is it possible to achieve this without prepending [0, 0] to emb?

Upvotes: 2

Views: 2068

Answers (1)

mrry
mrry

Reputation: 126154

The tf.nn.embedding_lookup(params, ids) function only accepts dense, rectangular tensors as the ids argument. (In general, the same goes for all TensorFlow operators that expect a tf.Tensor or tensor-like argument such as a NumPy array.)

For sparse data, you can use tf.nn.embedding_lookup_sparse(), which accepts a tf.SparseTensor as its argument, which can represent sequences of different lengths. A tf.SparseTensor is defined from three separate (dense) tensors, representing the indices of the non-zeroes, the values of the non-zeroes, and the overall dense shape. For your example of inputs, the representation would be:

inputs_sparse = tf.SparseTensor(
    # The coordinates of the non-zero entries.
    indices=tf.constant([[0, 0], [0, 1], [0, 2], [0, 3],
                         [1, 0], [1, 1]]),
    # The values of the respective non-zero entries.
    values=tf.constant([0, 1, 2, 3,
                        3, 2]),
    # The shape of the corresponding dense tensor (must be >= [2, 4]).
    shape=[2, 4],
)

Upvotes: 2

Related Questions