Ayush
Ayush

Reputation: 85

BigQuery Loop - Performance Optimization

I have query which basically fetches few cols from a tbl based on few conditions and then do the update thing. And, all this operation runs in a loop - ie. for each record found from the select query, it does the update thing which is hitting the performance badly.

for temp in
  (
    select a, b, c, d
    from tbl1
    where
    a in
                          (select w2.a from tbl1 w2
                          where cast(w2.c as int64) > 0 and
                          trim(w2.d) = ''
                          )
    and
  b in
                              (select w2.b from tbl1 w2
                              where cast(w2.c as int64) > 0 and trim(w2.d) = ''
                              )
    and cast(c as int64) = 0
    order by a, b, c
  )
  do
    update tbl1 wrkup
    set wrkup.d = temp.d
    where wrkup.a = temp.a
    and wrkup.b = temp.b
    and cast(wrkup.c as int64) > 0
    and trim(wrkup.d ) = '';
  end for;

The above query took 1hr 23mins to update 1146 statements and tbl1 has total 419668 records. Please see, I'm reading the same table tbl1 in for loop and then making the update to it based on different conditions.

I did check this article: (https://medium.com/google-cloud/bigquery-looping-optimization-8d575e4b72d0#:~:text=Looping%20is%20a%20sequence%20of%20actions%20that%20is%20continually%20repeated,iteration%20will%20not%20get%20triggered) But, it was kinda complex approach to implement. So, wondering if there's a simpler approach to address this issue.

Upvotes: 0

Views: 139

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173106

There is You have not provided any sample data so below is purely from observation your original query and not tested at all
So, I feel like below is equivalent of your script but transform in just single update statement

update tbl1 wrkup
set wrkup.d = temp.d
from (
  select a, b, c, d
  from tbl1
  where
  a in
    (select w2.a from tbl1 w2
    where cast(w2.c as int64) > 0 and
    trim(w2.d) = ''
    )
  and
  b in
    (select w2.b from tbl1 w2
    where cast(w2.c as int64) > 0 and trim(w2.d) = ''
    )
  and cast(c as int64) = 0
  order by a, b, c
) as temp
where wrkup.a = temp.a
and wrkup.b = temp.b
and cast(wrkup.c as int64) > 0
and trim(wrkup.d ) = '';

Upvotes: 1

guillaume blaquiere
guillaume blaquiere

Reputation: 75940

The problem with your design is the SQL script runs the query in sequence. The way to optimize the processing is to parallelize the processing. For that, you have to do it externally.

Typically, you can use a script (python), to run the "for" query. On each row (1146 if I understand correctly), you can create a PubSub message or Cloud task with the data of the row

Then you plug another processing unit, like Cloud Run or Cloud Functions that get the message data, and run the "do" in BigQuery. Like that, you can parallelize all the requests.

In case of issue, especially in case of quota exceeded, quota that you should reach, the query will fail but the message will be re-submitted. In any case, in the end you could reduce by 10x or even 100x the duration.

Upvotes: 1

Related Questions