Tensorflow Dataset.bucket_by_sequence_length throws TypeError

Question

I'm trying to construct a dataset of variable length English/Japanese sentences for a machine translation problem, but I can't get the Dataset.bucket_by_sequence_length function to work. It throws the following:

TypeError: Tensor.__init__() missing 3 required positional arguments: 'op', 
'value_index', and 'dtype'

Despite my best efforts I have not been able to diagnose the problem. I've tried using a named function for element_length_func, passing a single dataset entry, passing only the english sentences, and trying various manual and dynamically generated values for bucket_boundaries and bucket_batch_sizes. The code for constructing the dataset from a list of lists of varying length containing integer indexes is included below. Any suggestions or possible solutions?

# Create initial dataset
eng, jap = map(list, zip(*data))
assert len(eng) == len(jap)
eng = tf.ragged.constant(eng, dtype=tf.uint16)
jap = tf.ragged.constant(jap, dtype=tf.uint16)

dataset = tf.data.Dataset.from_tensor_slices((eng, jap))

# Bucket based on sequence length
vocab = tokenizer.get_vocab()
dataset = dataset.bucket_by_sequence_length(
    element_length_func=lambda x, _=None: tf.shape(x)[0],
    bucket_boundaries=[100],
    bucket_batch_sizes=[BATCH_SIZE, BATCH_SIZE],
    padding_values=vocab["[PAD]"],
)

Tensorflow Dataset.bucket_by_sequence_length throws TypeError

Answers (1)

Related Questions