Pyy
Pyy

Reputation: 367

Dataflow SDK version

I'm running into a problem testing out Dataflow by running code like this from a Datalab cell.

import apache_beam as beam

# Pipeline options:
options                         = beam.options.pipeline_options.PipelineOptions()
gcloud_options                  = options.view_as(beam.options.pipeline_options.GoogleCloudOptions)
gcloud_options.job_name         = 'test002'
gcloud_options.project          = 'proj'
gcloud_options.staging_location = 'gs://staging'
gcloud_options.temp_location    = 'gs://tmp'
# gcloud_options.region           = 'europe-west2'

# Worker options:
worker_options                  = options.view_as(beam.options.pipeline_options.WorkerOptions)
worker_options.disk_size_gb     = 30
worker_options.max_num_workers  = 10

# Standard options:
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'
# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner'

# Pipeline:

PL = beam.Pipeline(options=options)

query = 'SELECT * FROM [bigquery-public-data:samples.shakespeare] LIMIT 10'    
(
    PL | 'read'  >> beam.io.Read(beam.io.BigQuerySource(project='project', use_standard_sql=False, query=query))
       | 'write' >> beam.io.WriteToText('gs://test/test2.txt', num_shards=1)
)

PL.run()

print "Complete"

There has been various successful attempts and a few that have failed. This is fine and understood but what I don't understand is what I have done to change the SDK version from 2.9.0 to 2.0.0, as shown below. Could anyone point out what I've done and how to move back up to SDK version 2.9.0 please?

Dataflow screen

Upvotes: 3

Views: 1877

Answers (2)

Guillem Xercavins
Guillem Xercavins

Reputation: 7058

You can check which SDK version you'll use by running:

!pip freeze | grep beam

In your case this should return:

apache-beam==2.0.0

And just force the desired version (i.e. 2.9.0) by adding a cell on top:

!pip install apache-beam[gcp]==2.9.0

If you already submitted a job you might need to restart the kernel (reset session) for the change to take effect. There is a one-day difference between the jobs with different SDKs so my guess is that you or someone else changed the dependencies (assuming those were run from the same Datalab instance and notebook). Maybe without being aware of that (i.e. kernel restart).

Upvotes: 1

Eric Schmidt
Eric Schmidt

Reputation: 1327

Can you look at the Clouds for a failed job and post what you see?

Upvotes: 0

Related Questions