JolesS
JolesS

Reputation: 59

Speed up datasets loading on Google Colab

I am working on Image classification on the German Traffic Sign Dataset on Google Colab with Pytorch. Here is the structure of the dataset:

I have managed to upload the whole dataset to my drive(it took a long time!!!). I have used ImageFolder class and Dataset class to load respectively training and test images.

However, Training my model is really slow, and GPU is not used efficiently. After many searches, I discovered that file transfer from drive to Colab is at fault here.

Does anyone know how I can use hd5 dataset (or others techniques) to first store all training and test images for later preprocessing?

Upvotes: 2

Views: 9092

Answers (2)

maccaroo
maccaroo

Reputation: 837

The following code will copy a folder from your Google Drive to the Colab VM. (You will need to authorise the Drive share, as usual.) This improves model training time significantly over using the Drive mount during training.

I believe the copying time can be further improved by copying zipped files and then unzipping them at the destination - I haven't added that here.

import os
import shutil

from google.colab import drive
drive.mount('/content/drive')

def copy_files_recursive(source_folder, destination_folder):
    for root, dirs, files in os.walk(source_folder):
        for file in files:
            source_path = os.path.join(root, file)
            destination_path = os.path.join(destination_folder, os.path.relpath(source_path, source_folder))
            
            # Create destination directories if they don't exist
            os.makedirs(os.path.dirname(destination_path), exist_ok=True)
            
            shutil.copyfile(source_path, destination_path)

source_folder = '/content/drive/My Drive/xxx_folder'
destination_folder = '/content/xxx_folder'
copy_files_recursive(source_folder, destination_folder)

Upvotes: 0

Clay Coleman
Clay Coleman

Reputation: 371

If your problem truly is the network speed between Colab and Drive, you should try uploading the files directly to the Google Colab instance, rather than accessing them from Drive.

from google.colab import files
dataset_file_dict = files.upload()

Doing this will save the files directly to your Colab instance, allowing your code to access the files locally.

However, I'd suspect that there might be other problems besides the network latency – perhaps your model has lots of parameters, or somehow there was a bug in the code to get CUDA going. Sometimes I would forget to change my runtime to a GPU runtime under the "Runtime" menu tab, "Change Runtime Type".

Hope this helps!

Upvotes: 4

Related Questions