Reputation: 1886
I have a AML compute cluster with the min & max nodes set to 2. When I execute a pipeline, I expect the cluster to run the training on both instances in parallel. But the cluster status reports that only one node is busy and the other is idle.
Here's my code to submit the pipeline, as you can see, I'm resolving the cluster name and passing that to my Step1, thats training a model on Keras.
aml_compute = AmlCompute(ws, "cluster-name")
step1 = PythonScriptStep(name="train_step",
script_name="Train.py",
arguments=["--sourceDir", os.path.realpath(source_directory) ],
compute_target=aml_compute,
source_directory=source_directory,
runconfig=run_config,
allow_reuse=False)
pipeline_run = Experiment(ws, 'MyExperiment').submit(pipeline1, regenerate_outputs=False)
Upvotes: 2
Views: 1585
Reputation: 3961
Really great question. The TL;DR
is that there isn't an easy way to do that right now. IMHO there's a few questions within your questions -- here's a stab at all of them.
keras
I'm no keras
expert, but from their distributed training guide, I'm interested to know about the parallelism
that you are after? model parallelism or data parallelism?
For data parallelism, it looks like the tf.distribute
API is the way to go. I would strongly recommend getting that working on a single, multi-GPU machine (local or Azure VM) without Azure ML before starting to use Pipelines.
This Azure ML notebook shows how to use PyTorch with Horovod on AzureML. Seems not too tricky to change this to work with keras
.
As for how to get distributed training to work inside of an Azure ML Pipeline, one spitball workaround would be to have the PythonScriptStep
be a controller that would create a new compute cluster and submit the training script to it. I'm not too confident but I'll do some digging.
PythonScripSteps
This is possible (at least w/ pyspark
). Below is a PythonScriptStep
a production pipeline of ours that can run on more than one node. It uses a Docker image with Spark pre-installed, and a pyspark
RunConfiguration
. In the screenshots below you can see one of the nodes is the primary orchestrator, and the other is a secondary worker.
from azureml.core import Environment, RunConfiguration
env = Environment.from_pip_requirements(
'spark_env',
os.path.join(os.getcwd(), 'compute', 'spark-requirements.txt'))
env.docker.enabled = True
env.docker.base_image = 'microsoft/mmlspark:0.16'
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.environment = spark_env
spark_run_config.node_count = 2
roll_step = PythonScriptStep(
name='roll.py',
script_name='roll.py',
arguments=['--input_dir', joined_data,
'--output_dir', rolled_data],
compute_target=compute_target_spark,
inputs=[joined_data],
outputs=[rolled_data],
runconfig=spark_run_config,
source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
allow_reuse=pipeline_reuse
)
Upvotes: 2
Reputation: 143
Each python script step runs on a single node even if you allocate multiple nodes in your cluster. I'm not sure whether training on different instances is possible off-the-shelf in AML, but there's definitely the possibility to use that single node more effectively (looking into using all your cores, etc.)
Upvotes: 2