Reputation: 69
I'm trying to construct a dataset of variable length English/Japanese sentences for a machine translation problem, but I can't get the Dataset.bucket_by_sequence_length
function to work. It throws the following:
TypeError: Tensor.__init__() missing 3 required positional arguments: 'op',
'value_index', and 'dtype'
Despite my best efforts I have not been able to diagnose the problem. I've tried using a named function for element_length_func
, passing a single dataset entry, passing only the english sentences, and trying various manual and dynamically generated values for bucket_boundaries
and bucket_batch_sizes
. The code for constructing the dataset from a list of lists of varying length containing integer indexes is included below. Any suggestions or possible solutions?
# Create initial dataset
eng, jap = map(list, zip(*data))
assert len(eng) == len(jap)
eng = tf.ragged.constant(eng, dtype=tf.uint16)
jap = tf.ragged.constant(jap, dtype=tf.uint16)
dataset = tf.data.Dataset.from_tensor_slices((eng, jap))
# Bucket based on sequence length
vocab = tokenizer.get_vocab()
dataset = dataset.bucket_by_sequence_length(
element_length_func=lambda x, _=None: tf.shape(x)[0],
bucket_boundaries=[100],
bucket_batch_sizes=[BATCH_SIZE, BATCH_SIZE],
padding_values=vocab["[PAD]"],
)
Upvotes: 1
Views: 153
Reputation: 871
It doesn't seem like bucket_by_sequence_length()
(or more precisely, PaddedBatchDataset
) supports ragged tensor inputs.
(You can check that in your case, dataset.element_spec
consists of tf.RaggedTensorSpec
, while PaddedBatchDataset
wants tf.TensorSpec
: link)
Is it meaningful to pair input data like this instead:
eng, jap = # some iterables (and *not* tensors)
dataset = tf.data.Dataset.zip((
tf.data.Dataset.from_generator(lambda:eng, tf.uint16, output_shapes=[None]),
tf.data.Dataset.from_generator(lambda:jap, tf.uint16, output_shapes=[None])))
and then to apply tf.data.experimental.bucket_by_sequence_length()
on the result?
Upvotes: 1