Leon Cullens
Leon Cullens

Reputation: 12476

Best Practices for Azure Machine Learning Pipelines

I started working with Azure Machine Learning Service. It has a feature called Pipeline, which I'm currently trying to use. There are, however, are bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.

  1. When I look at 'batch scoring' examples, it is implemented as a Pipeline Step. This raises the question: does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this? Making 1 pipeline that combines both steps seems odd to me, because you don't want to run your predicting part every time you change something to the training part (and vice versa).
  2. What parts should be implemented as a Pipeline Step and what parts shouldn't? Should the creation of the Datastore and Dataset be implemented as a step? Should registering a model be implemented as a step?
  3. What isn't shown anywhere is how to deal with model registry. I create the model in the training step and then write it to the output folder as a pickle file. Then what? How do I get the model in the next step? Should I pass it on as a PipelineData object? Should train.py itself be responsible for registering the trained model?

Upvotes: 3

Views: 2415

Answers (2)

Trevor Bye
Trevor Bye

Reputation: 708

Anders has a great answer, but I'll expand on #1 a bit. In the batch scoring examples you've seen, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.

However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse param and set to True, which will cache the step output in the pipeline to prevent unnecessary reruns.

Take a model training step for example, and consider the following input to that step:

  • training script
  • input data
  • additional step params

If you set allow_reuse=True, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.

In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.

Upvotes: 5

Anders Swanson
Anders Swanson

Reputation: 3961

Azure ML pipelines best practices are emergent, so I can give you some recommendations, but I'd be surprised if others respond with divergent deeply-held opinions. The Azure ML product group also is improving and expanding on the product at a phenomenal pace, so I fully expect things to change (for the better) over time. This article does a good job of explaining ML pipelines

3 Passing a model to a downstream step

How do I get the model in the next step?

During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData as a pickled file.

In production, the scoring step should use a previously registered model.

Our team uses a PythonScriptStep that has a script argument that allows a model to be passed from an upstream step or fetched from the registry. The screenshot below shows our batch score step usings a PipelineData named best_run_data which contains the best model (saved as model.pkl) from a HyperDriveStep.

enter image description here

The definition of our batch_score_step has an boolean argument, '--use_model_registry', that determines whether to use the recently trained model, or whether to use the model registry. We use a function, get_model_path() to pivot on the script arg. Here are some code snippets of the above.

2 Control Plane vs Data Plane

What parts should be implemented as a Pipeline Step and what parts shouldn't?

All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep's. The inputs and outputs of which should be PipelineData's.

Azure ML artifacts should be: - created in the pipeline control plane using PipelineData, and - registered either: - ad-hoc, as opposed to with every run, or - when you need to pass artifacts between pipelines.

In this way PipelineData is the glue that connects pipeline steps directly rather than being indirectly connected w/ .register() and .download()

PipelineData's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.

Dataset's are abstractions of PipelineDatas in that they make things easier to pass to AutoMLStep and HyperDriveStep, and DataDrift

1 Pipeline encapsulation

does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this?

your pipeline architecture depends on if:

  1. you need to predict live (else batch prediction is sufficient), and
  2. your data is already transformed and ready for scoring.

If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:

  • a training pipeline at the end of which you register a model that is then used in a scoring pipeline, or
  • do as we do and have one pipeline that can be configured to do either using script arguments.

Upvotes: 4

Related Questions