Josh Reuben
Josh Reuben

Reputation: 289

How to use Google DataFlow Runner and Templates in tf.Transform?

We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.

We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:

from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name) 
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)

TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:

The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?

Upvotes: 1

Views: 920

Answers (2)

Python templates are available as of April 2017 (see documentation). The way to operate them is the following:

  • Define UserOptions subclassed from PipelineOptions.
  • Use the add_value_provider_argument API to add specific arguments to be parameterized.
  • Regular non-parameterizable options will continue to be defined using argparse's add_argument.
class UserOptions(PipelineOptions):
     @classmethod
     def _add_argparse_args(cls, parser):
         parser.add_value_provider_argument('--value_provider_arg', default='some_value')
         parser.add_argument('--non_value_provider_arg', default='some_other_value')

Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).

Upvotes: 6

Alex Amato
Alex Amato

Reputation: 1725

Unfortunately, Python pipelines cannot be used as templates. It is only available for Java today. Since you need to use the python library, it will not be feasible to do this.

tensorflow_transform would also need to support ValueProvider so that you can pass in options as a value provider type through it.

Upvotes: 1

Related Questions