Dataflow from Colab issue

Question

I'm trying to run a Dataflow job from Colab and getting the following worker error:

sdk_worker_main.py: error: argument --flexrs_goal: invalid choice: '/root/.local/share/jupyter/runtime/kernel-1dbd101c-a79e-432e-89b3-5ba68df104d7.json' (choose from 'COST_OPTIMIZED', 'SPEED_OPTIMIZED')

I haven't provided the flexrs_goal argument, and if I do it doesn't fix this issue. Here are my pipeline options:

beam_options = PipelineOptions(
    runner='DataflowRunner',
    project=...,
    job_name=...,
    temp_location=...,
    subnetwork='regions/us-west1/subnetworks/default',
    region='us-west1'
)

My pipeline is very simple, it's just:

with beam.Pipeline(options=beam_options) as pipeline:
  (pipeline 
    | beam.io.ReadFromBigQuery(
        query=f'SELECT column FROM {BQ_TABLE} LIMIT 100')
    | beam.Map(print))

It looks like the command line args for the sdk worker are getting polluted by jupyter somehow. I've rolled back to the past two apache-beam library versions and it hasn't helped. I could move over to Vertex Workbench but I've invested a lot in this Colab notebook (plus I like the easy sharing) and I'd rather not migrate.

Dataflow from Colab issue

Answers (1)

Related Questions