Porada Kev
Porada Kev

Reputation: 513

BigQuery. Long execution time on small datasets

I created a new Google cloud project and set up BigQuery data base. I tried different queries, they all are executing too long. Currently we don't have a lot of data, so high performance was expected.

Below are some examples of queries and their execution time.

Query #1 (Job Id bquxjob_11022e81_172cd2d59ba):

select date(installtime) regtime
  ,count(distinct userclientid) users
  ,sum(fm.advcost) advspent
from DWH.DimUser du
join DWH.FactMarketingSpent fm on fm.date = date(du.installtime)
group by 1

The query failed in 1 hour + with error "Query exceeded resource limits. 14521.457814668494 CPU seconds were used, and this query must use less than 12800.0 CPU seconds."

Query execution plan: https://prnt.sc/t30bkz

Query #2 (Job Id bquxjob_41f963ae_172cd41083f):

select fd.date 
      ,sum(fd.revenue) adrevenue
      ,sum(fm.advcost) advspent
from DWH.FactAdRevenue fd
join DWH.FactMarketingSpent fm on fm.date = fd.date
group by 1

Execution time ook 59.3 sec, 7.7 MB processed. What is too slow.

Query Execution plan: https://prnt.sc/t309t4

Query #3 (Job Id bquxjob_3b19482d_172cd31f629)

select date(installtime) regtime
  ,count(distinct userclientid) users 
from DWH.DimUser du 
group by 1

Execution time 5.0 sec elapsed, 42.3 MB processed. What is not terrible but must be faster for such small volumes of data.

Tables used : DimUser - Table size 870.71 MB, Number of rows 2,771,379 FactAdRevenue - Table size 6.98 MB, Number of rows 53,816 FaceMarketingSpent - Table size 68.57 MB, Number of rows 453,600

The question is what am I doing wrong so that query execution time is so big? If everything is ok, I would be glad to hear any advice on how to reduce execution time for such simple queries. If anyone from google reads my question, I would appreciate if jobids are checked.

Thank you!

P.s. Previously I had experience using BigQuery for other projects and the performance and execution time were incredibly good for tables of 50+ TB size.

Upvotes: 0

Views: 1568

Answers (1)

Rlaxx
Rlaxx

Reputation: 26

Posting same reply i've given in the gcp slack workspace:

Both your first two queries looks like you have one particular worker who is overloaded. Can see this because in the compute section, the max time is very different from the avg time. This could be for a number of reasons, but i can see that you are joining a table of 700k+ rows (looking at the 2nd input) to a table of ~50k (looking at the first input). This is not good practice, you should switch it so the larger table is the left most table. see https://cloud.google.com/bigquery/docs/best-practices-performance-compute?hl=en_US#optimize_your_join_patterns

You may also have a heavily skew in your join keys (e.g. 90% of rows are on 1/1/2020, or NULL). check this.

For the third query, that time is expected, try a approx count instead to speed it up. Also note BQ starts to get better if you perform the same query over and over, so this will get quicker.

Upvotes: 1

Related Questions