BigQuery: join on clusters fields

Question

I'm trying to optimize our BigQuery model by using clustered tables.

I'm testing these scenarios:

Clustered table join clustered table (join by clusters fields)
Not Clustered table join Not clustered table (join by the same fields that scenario 1)

Without apply any where condition scenario 1 and 2 have equal cost(time and bytes processed). When I apply a condition by a clustered field 1 is 4x time faster and cheaper.

Clustered fields are only useful when you use a condition in the query? and not in a join? In this case, if I perform a join without any condition, the performance is the same with or without the clusters

How can I improve a join between two tables in BigQuery?

EDIT 2021-05-31

Add query execution plan of both jobs:

Miller · Accepted Answer

From docs, I would say that the cluster will be simply ignore as you are comparing using another column, during the Join.

Now, for optimizing a join, you could try to reduce data before the join. For instance, try filtering the tables, or pre-aggregating them to reduce as much data as you can. Finally, also take care of the order of the tables on the join. Order them from the biggest to the smallest one.

BigQuery: join on clusters fields

Answers (1)

Related Questions