Reputation: 1094
I am trying to read records from a BigQuery table which has 2410957408 records. And it is taking forever to just to read them using BigQueryIO.readTableRows() in Apache Beam.
I am using the default machine type "n1-standard-1" and Autoscaling.
What can be done improve the performance significantly without having a lot of impact on cost? Will a high-mem or high-cpu machine type help?
Upvotes: 3
Views: 1658
Reputation: 17913
I looked at the job you quoted, and it seems that the majority of the time is spent on Beam ingesting the data exported by BigQuery, specifically, it seems, on converting the BigQuery export results to TableRow
. TableRow
is a very bulky and inefficient object - for better performance I recommend using BigQueryIO.read(SerializableFunction)
for reading into your custom type directly.
Upvotes: 1
Reputation: 1901
BigQueryIO.readTableRows()
will first export table data into a gcs bucket and beam workers will consume the export from there. The export stage is a BigQuery API, which is not very performant and not part of beam implementation.
Upvotes: 3