Flatten Dataset of multiple files tensorflow

Question

I'm trying to read the CIFAR-10 dataset from 6 .bin files, and then create a initializable_iterator. This is the site I downloaded the data from, and it also contains a description of the structure of the binary files. Each file contains 2500 images. The resulting iterator, however, only generates one tensor for each file, a tensor of size (2500,3703). Here is my code

import tensorflow as tf

filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")    
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))

iter_ = image_dataset.make_initializable_iterator()
next_file_data = iter_.get_next()I 

next_file_data = tf.reshape(next_file_data, [-1,3073])
next_file_img_data, next_file_labels = next_file_data[:,:-1], next_file_data[:,-1]
next_file_img_data = tf.reshape(next_file_img_data, [-1,32,32,3])

init_op = iter_.initializer

with tf.Session() as sess:
    sess.run(init_op)
    print(next_file_img_data.eval().shape) 


_______________________________________________________________________

>> (2500,32,32,3)

The first two lines are based on this answer. I would like to be able to specify the number of images generated by get_next(), using batch() rather than it being the number of images in each .bin file, which here is 2500.

There has already been a question about flattening a dataset here, but the answer is not clear to me. In particular, the question seems to contain a code snippet from a class function which is defined elsewhere, and I am not sure how to implement it.

I have also tried creating the dataset with tf.data.Dataset.from_tensor_slices(), replacing the first line above with

import os

filenames = [os.path.join('cifar-10-batches-bin',f) for f in os.listdir("cifar-10-batches-bin") if f.endswith('.bin')]
filename_dataset = tf.data.Dataset.from_tensor_slices(filenames)

but this didn't solve the problem.

Any help would be very much appreciated. Thanks.

kvish · Accepted Answer

I am not sure how your bin file is structured. I am assuming 32*32*3 = 3072 points per image is present in each file. So the data present in each file is a multiple of 3072. However for any other structure, the kind of operations would be similar, so this can still serve as a guide for that. You could do a series of mapping operations:

import tensorflow as tf

filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")    
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
image_dataset = image_dataset.map(lambda x: tf.reshape(x, [-1, 32, 32, 3]) # Reshape your data to get 2500, 32, 32, 3
image_dataset = image_dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)) # This operation would give you tensors of shape 32,32,3 and put them all together.
image_dataset = image_dataset.batch(batch_size) # Now you can define your batchsize

Flatten Dataset of multiple files tensorflow

Answers (1)

Related Questions