jsxgd
jsxgd

Reputation: 413

BigQuery MERGE statement billing more bytes than editor shows

I have a very large (3.5B records) table that I want to update/insert (upsert) using the MERGE statement in BigQuery. The source table is a staging table that contains only the new data, and I need to check if the record with a corresponding ID is in the target table, updating the row if so or inserting if not.

The target table is partitioned by an integer field called IdParent, and the matching is done on IdParent and another integer field called IdChild. My merge statement/script looks like this:

declare parentList array<int64>;

set parentList = array(select distinct IdParent from dataset.Staging);

merge into dataset.Target t
using dataset.Staging s
on
  -- target is partitioned by IdParent, do this for partition pruning
  t.IdParent in unnest(parentList)
  and t.IdParent = s.IdParent
  and t.IdChild = s.IdChild
when matched and t.IdParent in unnest(parentList) then
  update
    set t.Column1 = s.Column1,
    t.Column2 = s.Column2,
    ...<more columns>
when not matched and IdParent in unnest(parentList) then
  insert (<all the fields>)
  values (<all the fields)
;

So I:

The total size of dataset.Target is ~250GB. If I put this script in my BQ editor and remove all the IdParent in unnest(parentList) then it shows ~250GB to bill in the editor (as expected since there's no partition pruning). If I add the IdParent in unnest(parentList) back in so the script is exactly like you see it above i.e. attempting to partition prune, the editor shows ~97MB to bill. However, when I look at the query results, I see that it actually billed ~180GB:

Different bytes billed vs. editor's estimate

The target table is also clustered on the two fields being matched, and I'm aware that the benefits of clustering are typically not shown in the editor's estimate. However, my understanding is that that should only make the bytes billed smaller... I can't think of any reason why this would happen.

Is this a BQ bug, or am I just missing something? BigQuery doesn't even say "the script is estimated to process XX MB", it says "This will process XX MB" and then it processes way more.

Upvotes: 0

Views: 1180

Answers (1)

Hui Zheng
Hui Zheng

Reputation: 3097

That's very interesting. What you did seems totally correct.

It seems BQ query planner could interpret your SQL correctly and know the partition pruning is provided, but when it executes. it failed to do so.

try removing t.IdParent in unnest(parentList) from both when matched clauses to see if the issue still happens, that is,

declare parentList array<int64>;

set parentList = array(select distinct IdParent from dataset.Staging);

merge into dataset.Target t
using dataset.Staging s
on
  -- target is partitioned by IdParent, do this for partition pruning
  t.IdParent in unnest(parentList)
  and t.IdParent = s.IdParent
  and t.IdChild = s.IdChild
when matched then
  update
    set t.Column1 = s.Column1,
    t.Column2 = s.Column2,
    ...<more columns>
when not matched then
  insert (<all the fields>)
  values (<all the fields)
;

It would be a good idea to submit a bug to BigQuery if it couldn't be resolved.

Upvotes: 0

Related Questions