BigQueryIO.read() performance very slow - Apache Beam

I am trying to read records from a BigQuery table which has 2410957408 records. And it is taking forever to just to read them using BigQueryIO.readTableRows() in Apache Beam.

I am using the default machine type "n1-standard-1" and Autoscaling.

What can be done improve the performance significantly without having a lot of impact on cost? Will a high-mem or high-cpu machine type help?

Upvotes: 3

Answers (2)

jkff

Reputation: 17913

I looked at the job you quoted, and it seems that the majority of the time is spent on Beam ingesting the data exported by BigQuery, specifically, it seems, on converting the BigQuery export results to TableRow. TableRow is a very bulky and inefficient object - for better performance I recommend using BigQueryIO.read(SerializableFunction) for reading into your custom type directly.

Upvotes: 1

Jiayuan Ma

Reputation: 1901

BigQueryIO.readTableRows() will first export table data into a gcs bucket and beam workers will consume the export from there. The export stage is a BigQuery API, which is not very performant and not part of beam implementation.

Upvotes: 3

BigQueryIO.read() performance very slow - Apache Beam

Answers (2)

Related Questions