Will
Will

Reputation: 1631

Dataflow from Colab issue

I'm trying to run a Dataflow job from Colab and getting the following worker error:

sdk_worker_main.py: error: argument --flexrs_goal: invalid choice: '/root/.local/share/jupyter/runtime/kernel-1dbd101c-a79e-432e-89b3-5ba68df104d7.json' (choose from 'COST_OPTIMIZED', 'SPEED_OPTIMIZED')

I haven't provided the flexrs_goal argument, and if I do it doesn't fix this issue. Here are my pipeline options:

beam_options = PipelineOptions(
    runner='DataflowRunner',
    project=...,
    job_name=...,
    temp_location=...,
    subnetwork='regions/us-west1/subnetworks/default',
    region='us-west1'
)

My pipeline is very simple, it's just:

with beam.Pipeline(options=beam_options) as pipeline:
  (pipeline 
    | beam.io.ReadFromBigQuery(
        query=f'SELECT column FROM {BQ_TABLE} LIMIT 100')
    | beam.Map(print))

It looks like the command line args for the sdk worker are getting polluted by jupyter somehow. I've rolled back to the past two apache-beam library versions and it hasn't helped. I could move over to Vertex Workbench but I've invested a lot in this Colab notebook (plus I like the easy sharing) and I'd rather not migrate.

Upvotes: 3

Views: 501

Answers (1)

Will
Will

Reputation: 1631

Figured it out. The PipelineOptions constructor will pull in sys.argv if no parameter is given for the first argument (called flags). In my case it was pulling in the command line args that my jupyter notebook was started with and passing them as Beam options to the workers.

I fixed my issue by doing this:

beam_options = PipelineOptions(
    flags=[],
    ...
)

Upvotes: 4

Related Questions