Reputation: 463
I am trying to optimize a query that looks like the one below. Table A (tbla) is a big table in the order of TBs. Table B (tblb) is a small table of a few MBs.
select dim, avg(fact)
from
(
select tbla.fact, tblb.dim
from tbla
left outer join tblb on tbla.key = tblb.key
where 1=1
)
group by dim;
Execution of the above query typically involves reading Table A & Table B, then broadcasting Table B in each of the slots containing the parts of Table A, perform join, and finally aggregate.
The question is the following: how can I help BQ execute the query in the most efficient manner? I define efficiency as having a balanced (slots having similar workloads) query execution. The options I considered are:
(i) Run the query as-is without table partitions.
(ii) Partition Table A on "key": I was hoping that by providing a partition on the join key then BQ performance would somehow improve. However, this option results in a "skewed" execution since the partitioning is fairly uneven.
(iii) Create a balanced partition by partitioning on another tbla column. Then, if BQ reads the table using that partition the execution will be less skewed.
It turns out that approach (iii) results in a more balanced execution even if I cannot directly tell if the partition was used. In light of the above, I was wondering if there is a "rule of thumb" on using partitions when optimizing BQ joins between a large and a small table.
Upvotes: 0
Views: 763
Reputation: 1269583
You could always try pre-aggregating:
select b.dim, sum(sumf) / sum(cnt)
from (select key, sum(fact) as sumf, count(*) as cnt
from a
group by key
) a join
b
using (key)
group by b.dim;
BQ might do a better job distributing the values by key
on the larger table. This will reduce the volume of data and data skew won't make as much of a difference for the second aggregation.
Upvotes: 1