Davide Fiocco
Davide Fiocco

Reputation: 5924

How to overcome TrainingException when training a large model with Azure Machine Learning service?

I'm training a large-ish model, trying to use for the purpose Azure Machine Learning service in Azure notebooks.

I thus create an Estimator to train locally:

from azureml.train.estimator import Estimator

estimator = Estimator(source_directory='./source_dir',
                      compute_target='local',
                      entry_script='train.py')

(my train.py should load and train starting from a large word vector file).

When running with

run = experiment.submit(config=estimator)

I get

TrainingException:

====================================================================

While attempting to take snapshot of /data/home/username/notebooks/source_dir Your total snapshot size exceeds the limit of 300.0 MB. Please see http://aka.ms/aml-largefiles on how to work with large files.

====================================================================

The link provided in the error is likely broken. Contents in my ./source_dir indeed exceed 300 MB.
How can I solve this?

Upvotes: 2

Views: 1697

Answers (2)

Vlad Iliescu
Vlad Iliescu

Reputation: 8221

You can place the training files outside source_dir so that they don't get uploaded as part of submitting the experiment, and then upload them separately to the data store (which is basically using the Azure storage associated with your workspace). All you need to do then is reference the training files from train.py.

See the Train model tutorial for an example of how to upload data to the data store and then access it from the training file.

Upvotes: 4

Peter Pan
Peter Pan

Reputation: 24148

After I read the GitHub issue Encounter |total Snapshot size 300MB while start logging and the offical document Manage and request quotas for Azure resources for Azure ML service, I think it's an unknown issue which need some time to wait Azure to fix.

Meanwhile, I recommended that you can try to migrate the current work to the other service Azure Databricks, to upload your dataset and codes and then run it in the notebook of Azure Databricks which is host on HDInsight Spark Cluster without any worry about memory or storage limits. You can refer to these samples for Azure ML on Azure Databricks.

Upvotes: 0

Related Questions