Peter Nagel
Peter Nagel

Reputation: 181

Clustering in BigQuery using CREATE TABLE

Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard

See the the SQL query below to see what I do right now

CREATE OR REPLACE TABLE `dashboardgcp`
  PARTITION BY DATE(usage_start_time)
  CLUSTER BY billing_account_id
  AS
SELECT
  *
FROM
  `datagcp`
WHERE
 usage_start_time BETWEEN TIMESTAMP('2019-01-01')
  AND TIMESTAMP(CURRENT_DATE)

It is succesfully clustered like this, I am just not a noticeable query performance increase!

Upvotes: 1

Views: 2469

Answers (1)

Tamir Klein
Tamir Klein

Reputation: 3642

So I thought by clustering it with billing_ID I should see an increase in dashboard performance

Please consider the following points:

Cluster structure
A Cluster field is composed of an array of fields, like boxes, outer to inner, As state in BigQuery link

When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

This means As @Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this

Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for

Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster

From the docs:

“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.

Upvotes: 1

Related Questions