Reputation: 29
this is my first question here. I've been wanting to create a dataset with the popular IMDb dataset for learning purpose. The directories are as follows: .../train/pos/ and .../train/neg/ . I created a function which will merge text files with its labels and getting a error. I need your help to debug!
def datasetcreate(filepath, label):
filepaths = tf.data.Dataset.list_files(filepath)
return tf.stack([tf.data.Dataset.from_tensor_slices((_, tf.constant(label, dtype='int32'))) for _ in tf.data.TextLineDataset(filepaths)])
datasetcreate(['aclImdb/train/pos/*.txt'],1)
And this is the error I'm getting:
ValueError: Value tf.Tensor(b'An American in Paris was, in many ways, the ultimate.....dancers of all time.', shape=(), dtype=string) has insufficient rank for batching.
Why does this happen and what can I do to get rid of this? Thanks.
Upvotes: 3
Views: 4017
Reputation: 1084
Your code has two problems:
First, the way you load your TextLineDatasets, your loaded tensors contain string objects, which have an empty shape associated, i.e. a rank of zero. The rank of a tensor is the length of the shape property.
Secondly, you are trying to stack two tensors with different rank, which is would throw another error because, a sentence (a sequence of tokens) has a rank of 1 and the label as scalar has a rank of 0.
If you just need the dataset, I recommend to use the Tensorflow Dataset package, which has many ready-to-use datasets available.
If want to solve your particular problem, one way to fix your data pipeline is by using Datasest.interleave and the Dataset.zip functions.
# load positive sentences
filepaths = list(tf.data.Dataset.list_files('aclImdb/train/pos/*.txt'))
sentences_ds = tf.data.Dataset.from_tensor_slices(filepaths)
sentences_ds = sentences_ds.interleave(lambda text_file: tf.data.TextLineDataset(text_file) )
sentences_ds = sentences_ds.map( lambda text: tf.strings.split(text) )
# dataset for labels, create 1 label per file
labels = tf.constant(1, dtype="int32", shape=(len(filepaths)))
label_ds = tf.data.Dataset.from_tensor_slices(labels)
# combine text with label datasets
dataset = tf.data.Dataset.zip( (sentences_ds, label_ds) )
print( list(dataset.as_numpy_iterator() ))
First, you use the interleave function to combine multiple text datasets to one dataset. Next, you use tf.strings.split to split each text to its tokens. Then, you create a dataset for your positive labels. Finally, you combine the two datasets using zip.
IMPORTANT: To train/run any DL models on your dataset, you will likely need further pre-processing for your sentences, e.g. build a vocabulary and train word-embeddings.
Upvotes: 3