machine learning with remote image dataset (list of urls)

Question

I have a list of images and tags like these (presently in a pandas Dataframe)

tag_cat    tag_dog    tag_house    tag_person   url
--------------------------------------------------------------------------
True       True       False        False        http://example.com/...JPG
False      False      False        True         http://example.com/...JPG

which means that in the first image there ia a cat and a dog. Images are not preprocessed.

How to proceed? Should I download all the images, preprocess them and store locally? I would prefer to avoid it, I would prefer to download some images, preprocess them and to feed them to the optimization. As an alternative I would prefer a hybrid approac with a disk-cache: download few images, preprocess them, feed them to the optimization and in addition save the image to disk so that if I rerun I don't need to re-download the images.

Is there something that can help me with this?

Yuval Atzmon · Accepted Answer

When you train a machine learning model, you usually train the model for several cycles (epochs) of the data. In other words, you have to show all the data to your algorithm several times (tens-hundreds). From that perspective, downloading the images over and over is inefficient.

Another important point is models that take raw image pixels, usually take a lot of resources, and in order to avoid bottlenecks, and exploit your computational resources, you would like to feed the data as fast as you can to your machine. Downloading images for every batch, again sounds very inefficient.

Although I think it is inefficient, if you would still like to fetch the images from the web during training, you could write up a custom python generator for fetching the images from URLs, and then train the model in keras with fit_generator() method, which

Fits the model on data generated batch-by-batch by a Python generator.

Another alternative I can suggest, is that you can extract the image features once (with an already trained CNN), save them locally on your filesystem, and train a simpler model. Usually such features have a very low space footprint (e.g. 2048 float32 array per image), and therefore you could even store them within your pandas dataframe. Look here under 'Extract features with VGG16' for an example of how to extract image features

WRT the hybrid caching approach, it might be doable, but I am not sure that the machine-learning community is where you should inquire. But anyway, machine learning has enough complexities of its own and it might be better to focus your efforts on the algorithms and models, instead of a smart cachable software-pipeline

machine learning with remote image dataset (list of urls)

Answers (1)

Related Questions