BigQuery: why querying from temporary table takes 20+ times more time than individual aggregations over entire dataset?

Question

Table dataset.transactions contains ~5.000.000 records.

1) This query takes ~ 3 seconds:

SELECT *
  FROM dataset.transactions
  WHERE customer = 'c1' AND year = 2017

2) This query takes about 10 seconds:

SELECT
  salesrep_id AS id,
  date AS lastUsedForFplDate,
  fpl AS individual_fpl,
  ANY_VALUE(salesrep_name) AS salesrep_name,
  ANY_VALUE(customer) AS customer
FROM dataset.transactions VT1
  WHERE date = (
    SELECT
      MAX(date)
    FROM dataset.transactions VT2
    WHERE
      VT1.salesrep_id = VT2.salesrep_id
  )
GROUP BY
  salesrep_id,
  date,
  fpl

3) Whereas this one takes more than 200 seconds (tried few times, cancelled it each time after 200 seconds):

WITH transactions AS (
  SELECT *
  FROM dataset.transactions
  WHERE customer = 'c1' AND year = 2017
)

SELECT
  salesrep_id AS id,
  date AS lastUsedForFplDate,
  fpl AS individual_fpl,
  ANY_VALUE(salesrep_name) AS salesrep_name,
  ANY_VALUE(customer) AS customer
FROM transactions VT1
  WHERE date = (
    SELECT
      MAX(date)
    FROM transactions VT2
    WHERE
      VT1.salesrep_id = VT2.salesrep_id
  )
GROUP BY
  salesrep_id,
  date,
  fpl

Third query consist of the two queries above, except the #2 uses output of #1 as a source.

Why does it take so much time?

Xiaoxia Lin · Accepted Answer

Just not to leave the question unanswered. The third query uses self-joins which is an anti pattern as commented by Elliott.

The result of using a self-join is that it potentially doubles the number of output rows. This increase in output data can cause poor performance.

Instead of using a self-join, it's recommended to use a window (analytic) function to reduce the number of additional bytes that are generated by the query.

The simplified syntax of Analytic Functions is as below:

analytic_function_name ( [ argument_list ] )
  OVER (
    [ PARTITION BY partition_expression_list ]
    [ ORDER BY expression [{ ASC | DESC }] [, ...] ]
    [ window_frame_clause ]
  )

BigQuery: why querying from temporary table takes 20+ times more time than individual aggregations over entire dataset?

Answers (1)

Related Questions