Reputation: 1881
I have a long list of lists of integers (representing sentences, each one of different sizes) that I want to feed using the tf.data library. Each list (of the lists of list) has different length, and I get an error, which I can reproduce here:
t = [[4,2], [3,4,5]]
dataset = tf.data.Dataset.from_tensor_slices(t)
The error I get is:
ValueError: Argument must be a dense tensor: [[4, 2], [3, 4, 5]] - got shape [2], but wanted [2, 2].
Is there a way to do this?
EDIT 1: Just to be clear, I don't want to pad the input list of lists (it's a list of sentences containing over a million elements, with varying lengths) I want to use the tf.data library to feed, in a proper way, a list of lists with varying length.
Upvotes: 31
Views: 20659
Reputation: 438
For those working with TensorFlow 2 and looking for an answer I found the following to work directly with ragged tensors. which should be much faster than generator, as long as the entire dataset fits in memory.
t = [[[4,2]],
[[3,4,5]]]
rt=tf.ragged.constant(t)
dataset = tf.data.Dataset.from_tensor_slices(rt)
for x in dataset:
print(x)
produces
<tf.RaggedTensor [[4, 2]]> <tf.RaggedTensor [[3, 4, 5]]>
For some reason, it's very particular about having at least 2 dimensions on the individual arrays.
Upvotes: 10
Reputation: 5803
In addition to @mrry's answer, the following code is also possible if you would like to create (images, labels) pair:
import itertools
data = tf.data.Dataset.from_generator(lambda: itertools.izip_longest(images, labels),
output_types=(tf.float32, tf.float32),
output_shapes=(tf.TensorShape([None, None, 3]),
tf.TensorShape([None])))
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
image, label = sess.run(next_element) # ==> shape: [320, 420, 3], [20]
image, label = sess.run(next_element) # ==> shape: [1280, 720, 3], [40]
Upvotes: 1
Reputation: 126154
You can use tf.data.Dataset.from_generator()
to convert any iterable Python object (like a list of lists) into a Dataset
:
t = [[4, 2], [3, 4, 5]]
dataset = tf.data.Dataset.from_generator(lambda: t, tf.int32, output_shapes=[None])
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
print(sess.run(next_element)) # ==> '[4, 2]'
print(sess.run(next_element)) # ==> '[3, 4, 5]'
Upvotes: 27
Reputation: 1021
I don't think tensorflow supports tensors with varying numbers of elements along a given dimension.
However, a simple solution is to pad the nested lists with trailing zeros (where necessary):
t = [[4,2], [3,4,5]]
max_length = max(len(lst) for lst in t)
t_pad = [lst + [0] * (max_length - len(lst)) for lst in t]
print(t_pad)
dataset = tf.data.Dataset.from_tensor_slices(t_pad)
print(dataset)
Outputs:
[[4, 2, 0], [3, 4, 5]]
<TensorSliceDataset shapes: (3,), types: tf.int32>
The zeros shouldn't be a big problem for the model: semantically they're just extra sentences of size zero at the end of each list of actual sentences.
Upvotes: 0