Long initialization time for model.fit when using tensorflow dataset from generator

Question

This is my first question on stack overflow. I apologise in advance for the poor formatting and indentation due to my troubles with the interface.

Environment specifications:

Tensorflow version - 2.7.0 GPU (tested and working properly)

Python version - 3.9.6

CPU - Intel Core i7 7700HQ

GPU - NVIDIA GTX 1060 3GB

RAM - 16GB DDR4 2400MHz

HDD - 1TB 5400 RPM

Problem Statement:

I wish to train a TensorFlow 2.7.0 model to perform multilabel classification with six classes on CT scans stored as DICOM images. The dataset is from Kaggle, link here. The training labels are stored in a CSV file, and the DICOM image names are of the format ID_"random characters".dcm. The images have a combined size of 368 GB.

Approach used:

The CSV file containing the labels is imported into a pandas DataFrame and the image filenames are set as the index.
A simple data generator is created to read the DICOM image and the labels by iterating on the rows of the DataFrame. This generator is used to create a training dataset using tf.data.Dataset.from_generator. The images are pre-processed using bsb_window().
The training dataset is shuffled and split into a training(90%) and validation set(10%)
The model is created using Keras Sequential, compiled, and fit using the training and validation datasets created earlier.

code:

def train_generator():
   for row in df.itertuples():
       image = pydicom.dcmread(train_images_dir + row.Index + ".dcm")
       try:
           image = bsb_window(image)
       except:
           image = np.zeros((256,256,3))
       labels = row[1:]
       yield image, labels

train_images = tf.data.Dataset.from_generator(train_generator, 
                                              output_signature = 
                                             ( 
                                               tf.TensorSpec(shape = (256,256,3)), 
                                               tf.TensorSpec(shape = (6,))
                                              )
                                              )
train_images = train_images.batch(4)
TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)

def create_model():
   model = Sequential([
                       InceptionV3(include_top = False, input_shape = (256,256,3), weights = "imagenet"),
                       GlobalAveragePooling2D(name = "avg_pool"),
                       Dense(6, activation = "sigmoid", name = "dense_output"),
                       ])
   model.compile(loss = "binary_crossentropy", 
                 optimizer = tf.keras.optimizers.Adam(5e-4), 
                 metrics = ["accuracy", tf.keras.metrics.SpecificityAtSensitivity(0.8)]
                 )
   return model

model = create_model()
history = model.fit(train_images, 
                    batch_size=4, 
                    epochs=5, 
                    verbose=1, 
                    validation_data=val_images
                    )

Issue:

When executing this code, there is a delay of a few hours of high disk usage (~30MB/s reads) before training begins. When a DataGenerator is made using tf.keras.utils.Sequence, training commences within seconds of calling model.fit().

Potential causes:

Iterating over a pandas DataFrame in train_generator(). I am not sure how to avoid this issue.
The use of external functions to pre-process and load the data.
The usage of the take() and skip() methods to create training and validation datasets.

How do I optimise this code to run faster? I've heard splitting the data generator into label creation, image pre-processing functions and parallelising operations would improve performance. Still, I'm not sure how to apply those concepts in my case. Any advice would be highly appreciated.

Kumaresh Balaji · Accepted Answer

I FOUND THE ANSWER

The problem was in the following code:

TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)

It takes an inordinate amount of time to split the dataset into training and validation datasets after loading the images. This step should be done early in the process, before loading any images. Hence, I split the image path loading and actual image loading, then parallelized the functions using the recommendations given here. The final optimized code is as follows

def train_generator():
    for row in df.itertuples():
        image_path = f"{train_images_dir}{row.Index}.dcm"
        labels = np.reshape(row[1:], (1,6))
        yield image_path, labels

def test_generator():
    for row in test_df.itertuples():
        image_path = f"{test_images_dir}{row.Index}.dcm"
        labels = np.reshape(row[1:], (1,6))
        yield image_path, labels

def image_loading(image_path):
    image_path = tf.compat.as_str_any(tf.strings.reduce_join(image_path).numpy())
    dcm = pydicom.dcmread(image_path)
    try:
        image = bsb_window(dcm)
    except:
        image = np.zeros((256,256,3))
    return image

def wrap_img_load(image_path):
    return tf.numpy_function(image_loading, [image_path], [tf.double])

def set_shape(image, labels):
    image = tf.reshape(image,[256,256,3])
    labels = tf.reshape(labels,[1,6])
    labels = tf.squeeze(labels)
    return image, labels

train_images = tf.data.Dataset.from_generator(train_generator, output_signature = (tf.TensorSpec(shape=(), dtype=tf.string), tf.TensorSpec(shape=(None,6)))).prefetch(tf.data.AUTOTUNE)
test_images = tf.data.Dataset.from_generator(test_generator, output_signature = (tf.TensorSpec(shape=(), dtype=tf.string), tf.TensorSpec(shape=(None,6)))).prefetch(tf.data.AUTOTUNE)

TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)

train_images = train_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_images = test_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_images = val_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

train_images = train_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_images = test_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_images = val_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

train_images = train_images.batch(4).prefetch(tf.data.AUTOTUNE)
test_images = test_images.batch(4).prefetch(tf.data.AUTOTUNE)
val_images = val_images.batch(4).prefetch(tf.data.AUTOTUNE)

def create_model():
    model = Sequential([
    InceptionV3(include_top = False, input_shape = (256,256,3), weights='imagenet'),
    GlobalAveragePooling2D(name='avg_pool'),
    Dense(6, activation="sigmoid", name='dense_output'),
    ])
    model.compile(loss="binary_crossentropy", optimizer=tf.keras.optimizers.Adam(5e-4), metrics=["accuracy"])
    return model
model = create_model()

history = model.fit(train_images,
                    epochs=5,
                    verbose=1,
                    callbacks=[checkpointer, scheduler],
                    validation_data=val_images
                   )

The CPU, GPU, and HDD are utilized very efficiently, and the training time is much faster than with a tf.keras.utils.Sequence datagenerator

Long initialization time for model.fit when using tensorflow dataset from generator

Answers (1)

Related Questions