Reputation: 12476
I started working with Azure Machine Learning Service. It has a feature called Pipeline, which I'm currently trying to use. There are, however, are bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.
Upvotes: 3
Views: 2415
Reputation: 708
Anders has a great answer, but I'll expand on #1 a bit. In the batch scoring examples you've seen, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.
However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse
param and set to True
, which will cache the step output in the pipeline to prevent unnecessary reruns.
Take a model training step for example, and consider the following input to that step:
If you set allow_reuse=True
, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.
In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.
Upvotes: 5
Reputation: 3961
Azure ML pipelines best practices are emergent, so I can give you some recommendations, but I'd be surprised if others respond with divergent deeply-held opinions. The Azure ML product group also is improving and expanding on the product at a phenomenal pace, so I fully expect things to change (for the better) over time. This article does a good job of explaining ML pipelines
How do I get the model in the next step?
During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData
as a pickled file.
In production, the scoring step should use a previously registered model.
Our team uses a PythonScriptStep
that has a script argument that allows a model to be passed from an upstream step or fetched from the registry. The screenshot below shows our batch score step usings a PipelineData
named best_run_data
which contains the best model (saved as model.pkl
) from a HyperDriveStep
.
The definition of our batch_score_step
has an boolean argument, '--use_model_registry'
, that determines whether to use the recently trained model, or whether to use the model registry. We use a function, get_model_path()
to pivot on the script arg. Here are some code snippets of the above.
What parts should be implemented as a Pipeline Step and what parts shouldn't?
All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep
's. The inputs and outputs of which should be PipelineData
's.
Azure ML artifacts should be:
- created in the pipeline control plane using PipelineData
, and
- registered either:
- ad-hoc, as opposed to with every run, or
- when you need to pass artifacts between pipelines.
In this way PipelineData
is the glue that connects pipeline steps directly rather than being indirectly connected w/ .register()
and .download()
PipelineData
's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.
Dataset
's are abstractions of PipelineData
s in that they make things easier to pass to AutoMLStep
and HyperDriveStep
, and DataDrift
does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this?
your pipeline architecture depends on if:
If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:
Upvotes: 4