Deep learnin on Google Colab: loading large image dataset is very long, how to accelerate the process?

I'm working on a Deep Learning model using Keras and to speed up the computation I'd like to use the GPU available on google colab.

My image files are already loaded on my google drive. I have 24'000 images for training on 4'000 for testing my model.

However when I load my images into an array, it takes a very long time (almost 2h) So it is not very convenient to do that every time I use google colab notebook.

Would you know how to accelerate the process ? This is my current code:

TRAIN_DIR  = "Training_set/X"
TRAIN_DIR_Y = "Training_set/Y"
IMG_SIZE = 128

def parse_img_data(path):
    X_train = []
    index_train = []
    img_ind = []
    for img in tqdm(os.listdir(path)):
        img_ind.append(int(img.split('.')[0])-1)
        path = os.path.join(TRAIN_DIR,img)
        img = cv2.imread(path,cv2.IMREAD_COLOR)
        img = cv2.resize(img, (IMG_SIZE,IMG_SIZE))
        X_train.append(np.array(img))
    return np.array(img_ind), np.array(X_train)

ind_train, X_train = parse_img_data(TRAIN_DIR)

I'd be very grateful if you would help me.

Xavier

Upvotes: 6

Answers (4)

maki

Reputation: 21

from numpy import savez_compressed trainX, trainy = parse_img_data('/content/drive/My Drive/Training_set/') savez_compressed('dataset.npz', trainX, train)

for the first time you can load and save the data then Use it over and over again

import numpy as np data=np.load('/content/drive/My Drive/dataset.npz') trainX,trainy=data['arr_0'],data['arr_1']

Upvotes: 1

Cacey

Reputation: 51

Not sure if you solve the issue. I was having the same problem. After I use os.listdir to the particular data folder before I ran CNN and worked.

print(os.listdir("./drive/My Drive/Colab Notebooks/dataset"))

Upvotes: 5

brandata

Reputation: 81

I have been trying, and for those curious, it has not been possible for me to use flow from directory with a folder inside google drive. The collab file environment does not read the path and gives a "Folder does not exist" error. I have been trying to solve the problem and searching stack, similar questions have been posted here Google collaborative and here Google Colab can't access drive content , with no effective solution and for some reason, many downvotes to those who ask.

The only solution I find to reading 20k images in google colab, is uploading them and then processing them, wasting two sad hours to do so. It makes sense, google identifies things inside the drive with ids, flow from directory requires it to be identified both the dataset, and the classes with folder absolute paths, not being compatible with google drives identification method. Alternative might be using a google cloud enviroment instead I suppose and paying.We are getting quite a lot for free as it is. This is my novice understanding of the situation, please correct me if wrong.

edit1: I was able to use flow from directory on google collab, google does identify things with path also, the thing is that if you use os.getcwd(), it does not work properly, if you use it it will give you that the current working directory is "/content", when in truth is "/content/drive/My Drive/foldersinsideyourdrive/...../folderthathasyourcollabnotebook/. If you change in the traingenerator the path so that it includes this setting, and ignore os, it works. I had however, problems with the ram even when using flow from directory, not being able to train my cnn anyway, might be something that just happens to me though.

Make sure to execute

from google.colab import drive
drive.mount('/content/drive/')

so that the notebook recognizes the paths

Upvotes: 0

monatis

Reputation: 584

You can try to mount your Google Drive folder (you can find the code snippet from Examples menu) and use ImageDataGenerator with flow_from_directory(). Check documentation here

Upvotes: 0

Deep learnin on Google Colab: loading large image dataset is very long, how to accelerate the process?

Answers (4)

Related Questions