Reputation: 289
We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.
We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:
from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name)
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)
TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:
The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?
Upvotes: 1
Views: 920
Reputation: 251
Python templates are available as of April 2017 (see documentation). The way to operate them is the following:
class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument('--value_provider_arg', default='some_value') parser.add_argument('--non_value_provider_arg', default='some_other_value')
Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).
Upvotes: 6
Reputation: 1725
Unfortunately, Python pipelines cannot be used as templates. It is only available for Java today. Since you need to use the python library, it will not be feasible to do this.
tensorflow_transform would also need to support ValueProvider so that you can pass in options as a value provider type through it.
Upvotes: 1