How to overcome TrainingException when training a large model with Azure Machine Learning service?

Question

I'm training a large-ish model, trying to use for the purpose Azure Machine Learning service in Azure notebooks.

I thus create an Estimator to train locally:

from azureml.train.estimator import Estimator

estimator = Estimator(source_directory='./source_dir',
                      compute_target='local',
                      entry_script='train.py')

(my train.py should load and train starting from a large word vector file).

When running with

run = experiment.submit(config=estimator)

I get

TrainingException:

====================================================================

While attempting to take snapshot of /data/home/username/notebooks/source_dir Your total snapshot size exceeds the limit of 300.0 MB. Please see http://aka.ms/aml-largefiles on how to work with large files.

====================================================================

The link provided in the error is likely broken. Contents in my ./source_dir indeed exceed 300 MB.
How can I solve this?

Vlad Iliescu · Accepted Answer

You can place the training files outside source_dir so that they don't get uploaded as part of submitting the experiment, and then upload them separately to the data store (which is basically using the Azure storage associated with your workspace). All you need to do then is reference the training files from train.py.

See the Train model tutorial for an example of how to upload data to the data store and then access it from the training file.

How to overcome TrainingException when training a large model with Azure Machine Learning service?

Answers (2)

Related Questions