Celso França
Celso França

Reputation: 734

Padding and Masking a batch dataset

When representing multiple strings of natural language, the number of characters in each string may not be equal. Then, the return result could be placed in a tf.RaggedTensor, where the length of the innermost dimension varies depending on the number of characters in each string:

rtensor = tf.ragged.constant([
                      [1, 2], 
                      [3, 4, 5],
                      [6]
                      ])
rtensor
#<tf.RaggedTensor [[1, 2], [3, 4, 5], [6]]>

In turn, applying to_tensor method, converts that RaggedTensor into a regular tf.Tensor and consequently apply the padding operation:

batch_size=3
max_length=8
tensor = rtensor.to_tensor(default_value=0, shape=(batch_size, max_length))
#<tf.Tensor: shape=(3, 8), dtype=int32, numpy=
#array([[1, 2, 0, 0, 0, 0, 0, 0],
#       [3, 4, 5, 0, 0, 0, 0, 0],
#       [6, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>

Now, is there an approach to generate also an adjunct tensor showing what is original data and what is padding? For the example above it would be:

<tf.Tensor: shape=(3, 8), dtype=int32, numpy=
array([[1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>

Upvotes: 2

Views: 830

Answers (1)

javidcf
javidcf

Reputation: 59691

As thusv89 suggests, you can simply check for non-zero values. It can be as simple as converting to boolean and back.

import tensorflow as tf

rtensor = tf.ragged.constant([[1, 2],
                              [3, 4, 5],
                              [6]])
batch_size = 3
max_length = 8
tensor = rtensor.to_tensor(default_value=0, shape=(batch_size, max_length))
mask = tf.dtypes.cast(tf.dtypes.cast(tensor, tf.bool), tensor.dtype)
print(mask.numpy())
# [[1 1 0 0 0 0 0 0]
#  [1 1 1 0 0 0 0 0]
#  [1 0 0 0 0 0 0 0]]

The only possible drawback is that you might have had 0 values originally. You could use some other value as default value when converting to a tensor, for example -1, if you know that your data is always going to be non-negative:

tensor = rtensor.to_tensor(default_value=-1, shape=(batch_size, max_length))
mask = tf.dtypes.cast(tensor >= 0, tensor.dtype)

But if you want your mask to work for whatever values you have, you can also just use tf.ones_like with the ragged tensor:

rtensor_ones = tf.ones_like(rtensor)
mask = rtensor_ones.to_tensor(default_value=0, shape=(batch_size, max_length))

This way mask will always be one exactly where rtensor has a value.

Upvotes: 2

Related Questions