Jesse
Jesse

Reputation: 8393

Bigquery apache beam pipeline "hanging" when using the DirectRunner

I was curious to here if anyone else has encountered a similar problem with the python apache beam dataflow runner as described below. (I'm not able to ship to the CloudRunner just yet)

The query that is being executed returns just under 18 million rows. If I add a LIMIT to query (eg: 10000) then the datafow works as expected. Not included in the code snippet is the WriteToBleve sink which is a custom sink to support writing to a bleve index.

The python sdk that is being used is 2.2.0 but I'm getting ready to spark up some java....

The last log message I see when running the pipeline is:

WARNING:root:Dataset my-project:temp_dataset_7708fbe7e7694cd49b8b0de07af2470b does not exist so we will create it as temporary with location=None

The dataset is created correctly and populated and when I debug into the pipeline I can see the results being iterated, but this pipeline itself never seems to reach the write stage.

    options = {
        "project": "my-project",
        "staging_location": "gs://my-project/staging",
        "temp_location": "gs://my-project/temp",
        "runner": "DirectRunner"
    }
    pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
    p = beam.Pipeline(options=pipeline_options)
    p | 'Read From Bigquery' >> beam.io.Read(beam.io.BigQuerySource(
        query=self.build_query(),
        use_standard_sql=True,
        validate=True,
        flatten_results=False,
    )) | 'Write to Bleve' >> WriteToBleve()

    result = p.run()
    result.wait_until_finish()

Upvotes: 3

Views: 1549

Answers (1)

jkff
jkff

Reputation: 17913

The direct runner is meant to be used for local debugging and testing your pipeline on small amounts of data. It is not particularly optimized for performance and is not meant to be used for large amounts of data - this is the case both for Python and Java.

That said, currently some very serious improvements to the Python direct runner are in progress.

I recommend you try running on Dataflow and see if performance is still unsatisfactory.

Also, if you can write in Java - I recommend to do that: it often performs orders of magnitude better than Python, especially in the case of reading from BigQuery: reading BigQuery goes through a BigQuery export to Avro files, and performance of the standard Python library for reading Avro files is notoriously horrible, but unfortunately currently there is no well-performing and maintained replacement.

Upvotes: 3

Related Questions