Alex
Alex

Reputation: 45

Bigquery Pricing Comparison : Loading data into Bigquery vs Using Create External Table

My team is working on developing data platform using Google Cloud Platform. We uploaded our company's data on Google Cloud Storage and try to make data mart on Bigquery. However, in order to save GCP usage cost, we are considering to load all data from gcs to bigquery or create external table on bigquery.

Which way is more cost efficienct?

Upvotes: 3

Views: 5144

Answers (3)

evanstipico
evanstipico

Reputation: 1

As I cannot comment yet, this has to be a separate answer instead. It really does depend on what you mean by "make data mart." Staging is much easier out of a data lake than when uploading manually to a hard table that you then delete. You then have the ability to reach back into a table, usually a hive table, and retrieve old data. Data lakes excel at staging and Iceberg can even be used to get point in time data. However, GCP is not great for a complete medallion solution. You pay for scans as mentioned. Warehouses, if required, and data marts should probably be built in regular tables as opposed to external tables where there are more likely to be joins; etc. Also, you can partition in Hive as covered.

Upvotes: 0

WolfgangJules
WolfgangJules

Reputation: 61

The @guillaume blaquiere answer is okay, but he forget mention something important: it is possible to do partitioned queries. You can create partitioned external tables linked to a bucket in the storage. Eg:

  • gs://myBucket/myTable/dt=2019-10-31/lang=en/foo

  • gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar

Then, you can use "dt" or "lang" filters in SQL queries from BigQuery.

https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs

Upvotes: 6

guillaume blaquiere
guillaume blaquiere

Reputation: 75735

BigQuery and the external table capacity make the border between datalake (file) and data warehouse (structured data) blurry, and your question is relevant.

When you use external table, several feature are missing, like clustering and partitioning, and your file are parsed on the fly (with type casting) -> the processing time is slower and you can't control/limit the volume of data that your process. In addition of possible errors in file that will break your query

When you use native table, the data storage is optimize for the BigQuery processing, the data already clean and parsed, the table partitioned and clustered.


The question of cost is hard multiple. Firstly, we can talk about data storage. if you have file in GCS and the same data in BigQuery, you will pay the storage twice. However, after 90 days without any update, the data goes to "archive" storage mode in BigQuery and are 2 time cheaper. In addition, you can also move your GCS file to a cold storage after their integration in BigQuery.

That's for the storage. Then the processing. First of all, the processing roughly cost 10 times more than the storage, and it's the most important things to focus on. When you perform a BigQuery request, you pay for the volume of data that your query scan. If you have partitions or clusters, with BigQuery native tables, you can limit the amount of data that you scan and therefore reduce a lot the cost. With external tables, you can't use partitioning and clustering feature and therefore you always pay for the full amount of data.

Therefore, it depends (as always) on your volume of data and the frequency of the requests.


Don't forget something additional: with external table you can have error that can break your queries. In production mode, it can be dramatic. Think smart on that.

Finally, requesting external table is slower that native table (no partitioning, therefore more data to process and parsing/casting duration). Because time is money (if you have time critical queries), and that immaterial cost can also influence your choices.

Upvotes: 7

Related Questions