Dataflow Error with Apache Beam SDK 2.20.0

Question

I am trying to build an Apache Beam pipeline in Python 3.7 with beam sdk version 2.20.0, the pipeline gets deployed on Dataflow successfully but does not seem to be doing anything. In the worker logs, I can see the following error message repeatedly reported

Error syncing pod xxxxxxxxxxx (), skipping: Failed to start container worker log

I have tried everything I could but this error is quite stubborn, my pipeline looks like this.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import WorkerOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import DebugOptions

 options = PipelineOptions()

    options.view_as(GoogleCloudOptions).project = PROJECT
    options.view_as(GoogleCloudOptions).job_name = job_name
    options.view_as(GoogleCloudOptions).region = region
    options.view_as(GoogleCloudOptions).staging_location = staging_location
    options.view_as(GoogleCloudOptions).temp_location = temp_location

    options.view_as(WorkerOptions).zone = zone
    options.view_as(WorkerOptions).network = network
    options.view_as(WorkerOptions).subnetwork = sub_network
    options.view_as(WorkerOptions).use_public_ips = False

    options.view_as(StandardOptions).runner = 'DataflowRunner'
    options.view_as(StandardOptions).streaming = True

    options.view_as(SetupOptions).sdk_location = ''
    options.view_as(SetupOptions).save_main_session = True

    options.view_as(DebugOptions).experiments = []

    print('running pipeline...')

    with beam.Pipeline(options=options) as pipeline:
        (
                pipeline
                | 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=topic_name).with_output_types(bytes)
                | 'ProcessMessage' >> beam.ParDo(Split())
                | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(table=bq_table_name,
                                                               schema=bq_schema,
                                                               write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
         )

    result = pipeline.run()

I have tried supplying a beam sdk 2.20.0.tar.gz from the compute instance using sdk_location parameter, that doesn't work either. I can't use sdk_location = default as that triggers a download from pypi.org. I am working in an offline environment and connectivity to internet is not an option. Any help would be highly appreciated.

The pipeline itself is deployed on a container and all libraries that go with apache beam 2.20.0 are specified in a requirements.txt file, docker image installs all the libraries.

Dataflow Error with Apache Beam SDK 2.20.0

Answers (1)

Related Questions