Aaron Lockhart
Aaron Lockhart

Reputation: 69

Tensorflow Dataset.bucket_by_sequence_length throws TypeError

I'm trying to construct a dataset of variable length English/Japanese sentences for a machine translation problem, but I can't get the Dataset.bucket_by_sequence_length function to work. It throws the following:

TypeError: Tensor.__init__() missing 3 required positional arguments: 'op', 
'value_index', and 'dtype'

Despite my best efforts I have not been able to diagnose the problem. I've tried using a named function for element_length_func, passing a single dataset entry, passing only the english sentences, and trying various manual and dynamically generated values for bucket_boundaries and bucket_batch_sizes. The code for constructing the dataset from a list of lists of varying length containing integer indexes is included below. Any suggestions or possible solutions?

# Create initial dataset
eng, jap = map(list, zip(*data))
assert len(eng) == len(jap)
eng = tf.ragged.constant(eng, dtype=tf.uint16)
jap = tf.ragged.constant(jap, dtype=tf.uint16)

dataset = tf.data.Dataset.from_tensor_slices((eng, jap))

# Bucket based on sequence length
vocab = tokenizer.get_vocab()
dataset = dataset.bucket_by_sequence_length(
    element_length_func=lambda x, _=None: tf.shape(x)[0],
    bucket_boundaries=[100],
    bucket_batch_sizes=[BATCH_SIZE, BATCH_SIZE],
    padding_values=vocab["[PAD]"],
)

Upvotes: 1

Views: 153

Answers (1)

vasiliykarasev
vasiliykarasev

Reputation: 871

It doesn't seem like bucket_by_sequence_length() (or more precisely, PaddedBatchDataset) supports ragged tensor inputs.

(You can check that in your case, dataset.element_spec consists of tf.RaggedTensorSpec, while PaddedBatchDataset wants tf.TensorSpec: link)

Is it meaningful to pair input data like this instead:

eng, jap = # some iterables (and *not* tensors)
dataset = tf.data.Dataset.zip((                                                 
    tf.data.Dataset.from_generator(lambda:eng, tf.uint16, output_shapes=[None]),   
    tf.data.Dataset.from_generator(lambda:jap, tf.uint16, output_shapes=[None])))  

and then to apply tf.data.experimental.bucket_by_sequence_length() on the result?

Upvotes: 1

Related Questions