Reputation: 59
I am working on Image classification on the German Traffic Sign Dataset on Google Colab with Pytorch. Here is the structure of the dataset:
I have managed to upload the whole dataset to my drive(it took a long time!!!). I have used ImageFolder class and Dataset class to load respectively training and test images.
However, Training my model is really slow, and GPU is not used efficiently. After many searches, I discovered that file transfer from drive to Colab is at fault here.
Does anyone know how I can use hd5 dataset (or others techniques) to first store all training and test images for later preprocessing?
Upvotes: 2
Views: 9092
Reputation: 837
The following code will copy a folder from your Google Drive to the Colab VM. (You will need to authorise the Drive share, as usual.) This improves model training time significantly over using the Drive mount during training.
I believe the copying time can be further improved by copying zipped files and then unzipping them at the destination - I haven't added that here.
import os
import shutil
from google.colab import drive
drive.mount('/content/drive')
def copy_files_recursive(source_folder, destination_folder):
for root, dirs, files in os.walk(source_folder):
for file in files:
source_path = os.path.join(root, file)
destination_path = os.path.join(destination_folder, os.path.relpath(source_path, source_folder))
# Create destination directories if they don't exist
os.makedirs(os.path.dirname(destination_path), exist_ok=True)
shutil.copyfile(source_path, destination_path)
source_folder = '/content/drive/My Drive/xxx_folder'
destination_folder = '/content/xxx_folder'
copy_files_recursive(source_folder, destination_folder)
Upvotes: 0
Reputation: 371
If your problem truly is the network speed between Colab and Drive, you should try uploading the files directly to the Google Colab instance, rather than accessing them from Drive.
from google.colab import files
dataset_file_dict = files.upload()
Doing this will save the files directly to your Colab instance, allowing your code to access the files locally.
However, I'd suspect that there might be other problems besides the network latency – perhaps your model has lots of parameters, or somehow there was a bug in the code to get CUDA going. Sometimes I would forget to change my runtime to a GPU runtime under the "Runtime" menu tab, "Change Runtime Type".
Hope this helps!
Upvotes: 4