Simon
Simon

Reputation: 1823

BigQuery External GCS Table - Optimising Hive Partition Strategy

I've got an external table in BigQuery that pulls its data from Avro files on Google Cloud Storage. I'm currently hive partitioning the data on date as every query will use the date, with an emphasis on newer data. I'm considering also partitioning further on organisation.

I'm not finding much information on the best practices in terms of partitioning to maintain performance and keep costs low. Should I be aiming to keep the number of file reads low (ie have a small number of larger files) or should I be looking to keep the number of bytes being read by BigQuery low (more, smaller files with a fine-grained partition strategy)? Or perhaps it's more nuanced and there's a balance to be kept?

I know this is a tough question without understanding the dataset and queries but I just want to find somewhere to start from rather than just guessing and having to change it later.

Upvotes: 0

Views: 2595

Answers (1)

Nick_Kh
Nick_Kh

Reputation: 5243

There is no general prescription approaching best performance querying a data stored externally (federated data) behind Bigquеry as it mostly depends on the use case and customer purpose, citing the GCP documentation:

  • Loading and cleaning your data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
  • Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated.

As I mentioned in the comment, due to external data source limitations, if the query performance is the leading factor, when it is recommended to switch to classic way loading data to Bigquery sink:

Query performance for external data sources may not be as high as querying data in a native BigQuery table. If query speed is a priority, load the data into BigQuery instead of setting up an external data source.

Having said this, there is no specific enhancement in the I/O operations with GCS in terms of usage it with Bigquery external data sources:

In general, query performance for external data sources should be equivalent to reading the data directly from the external storage.

Upvotes: 2

Related Questions