Reputation: 43
This is my first question on stack overflow. I apologise in advance for the poor formatting and indentation due to my troubles with the interface.
Environment specifications:
Tensorflow version - 2.7.0 GPU (tested and working properly)
Python version - 3.9.6
CPU - Intel Core i7 7700HQ
GPU - NVIDIA GTX 1060 3GB
RAM - 16GB DDR4 2400MHz
HDD - 1TB 5400 RPM
Problem Statement:
I wish to train a TensorFlow 2.7.0 model to perform multilabel classification with six classes on CT scans stored as DICOM images. The dataset is from Kaggle, link here. The training labels are stored in a CSV file, and the DICOM image names are of the format ID_"random characters".dcm. The images have a combined size of 368 GB.
Approach used:
The CSV file containing the labels is imported into a pandas DataFrame and the image filenames are set as the index.
A simple data generator is created to read the DICOM image and the labels by iterating on the rows of the DataFrame. This generator is used to create a training dataset using tf.data.Dataset.from_generator. The images are pre-processed using bsb_window().
The training dataset is shuffled and split into a training(90%) and validation set(10%)
The model is created using Keras Sequential, compiled, and fit using the training and validation datasets created earlier.
code:
def train_generator():
for row in df.itertuples():
image = pydicom.dcmread(train_images_dir + row.Index + ".dcm")
try:
image = bsb_window(image)
except:
image = np.zeros((256,256,3))
labels = row[1:]
yield image, labels
train_images = tf.data.Dataset.from_generator(train_generator,
output_signature =
(
tf.TensorSpec(shape = (256,256,3)),
tf.TensorSpec(shape = (6,))
)
)
train_images = train_images.batch(4)
TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)
def create_model():
model = Sequential([
InceptionV3(include_top = False, input_shape = (256,256,3), weights = "imagenet"),
GlobalAveragePooling2D(name = "avg_pool"),
Dense(6, activation = "sigmoid", name = "dense_output"),
])
model.compile(loss = "binary_crossentropy",
optimizer = tf.keras.optimizers.Adam(5e-4),
metrics = ["accuracy", tf.keras.metrics.SpecificityAtSensitivity(0.8)]
)
return model
model = create_model()
history = model.fit(train_images,
batch_size=4,
epochs=5,
verbose=1,
validation_data=val_images
)
Issue:
When executing this code, there is a delay of a few hours of high disk usage (~30MB/s reads) before training begins. When a DataGenerator is made using tf.keras.utils.Sequence, training commences within seconds of calling model.fit().
Potential causes:
How do I optimise this code to run faster? I've heard splitting the data generator into label creation, image pre-processing functions and parallelising operations would improve performance. Still, I'm not sure how to apply those concepts in my case. Any advice would be highly appreciated.
Upvotes: 0
Views: 478
Reputation: 43
I FOUND THE ANSWER
The problem was in the following code:
TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)
It takes an inordinate amount of time to split the dataset into training and validation datasets after loading the images. This step should be done early in the process, before loading any images. Hence, I split the image path loading and actual image loading, then parallelized the functions using the recommendations given here. The final optimized code is as follows
def train_generator():
for row in df.itertuples():
image_path = f"{train_images_dir}{row.Index}.dcm"
labels = np.reshape(row[1:], (1,6))
yield image_path, labels
def test_generator():
for row in test_df.itertuples():
image_path = f"{test_images_dir}{row.Index}.dcm"
labels = np.reshape(row[1:], (1,6))
yield image_path, labels
def image_loading(image_path):
image_path = tf.compat.as_str_any(tf.strings.reduce_join(image_path).numpy())
dcm = pydicom.dcmread(image_path)
try:
image = bsb_window(dcm)
except:
image = np.zeros((256,256,3))
return image
def wrap_img_load(image_path):
return tf.numpy_function(image_loading, [image_path], [tf.double])
def set_shape(image, labels):
image = tf.reshape(image,[256,256,3])
labels = tf.reshape(labels,[1,6])
labels = tf.squeeze(labels)
return image, labels
train_images = tf.data.Dataset.from_generator(train_generator, output_signature = (tf.TensorSpec(shape=(), dtype=tf.string), tf.TensorSpec(shape=(None,6)))).prefetch(tf.data.AUTOTUNE)
test_images = tf.data.Dataset.from_generator(test_generator, output_signature = (tf.TensorSpec(shape=(), dtype=tf.string), tf.TensorSpec(shape=(None,6)))).prefetch(tf.data.AUTOTUNE)
TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)
train_images = train_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_images = test_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_images = val_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
train_images = train_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_images = test_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_images = val_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
train_images = train_images.batch(4).prefetch(tf.data.AUTOTUNE)
test_images = test_images.batch(4).prefetch(tf.data.AUTOTUNE)
val_images = val_images.batch(4).prefetch(tf.data.AUTOTUNE)
def create_model():
model = Sequential([
InceptionV3(include_top = False, input_shape = (256,256,3), weights='imagenet'),
GlobalAveragePooling2D(name='avg_pool'),
Dense(6, activation="sigmoid", name='dense_output'),
])
model.compile(loss="binary_crossentropy", optimizer=tf.keras.optimizers.Adam(5e-4), metrics=["accuracy"])
return model
model = create_model()
history = model.fit(train_images,
epochs=5,
verbose=1,
callbacks=[checkpointer, scheduler],
validation_data=val_images
)
The CPU, GPU, and HDD are utilized very efficiently, and the training time is much faster than with a tf.keras.utils.Sequence datagenerator
Upvotes: 1