Reputation: 494
I am attempting to set up a query that selects a subset of data from a range of daily partitions of Google Analytics session data and writes the data to a Google BigQuery staging table. The challenge for me is to reduce the processing cost when using a subquery in the WHERE clause.
Google Analytics data from the query are to be appended to a staging table before being processed and loaded into the target data table (my-data-table). The main query is given in two forms below. The first is hard-coded. The second reflects the preferred form. The upper bound on _TABLE_SUFFIX is hard-coded for both to simplify the query. The objective is to use MAX(date), where date has the form YYYYMMDD, from my-data-table as a lower bound on the ga_sessions_* daily partitions. The query has been simplified for presentation here but is believed to contain all necessary elements.
The aggregate query (SELECT MAX(date) FROM my-project-12345.dataset.my-data-table
) returns the value '20201015' and processes 202 KB. Depending upon whether I use the returned value explicitly (as '20201015') in the WHERE clause of the main query or use the SELECT MAX() query in the WHERE clause, there is a significant difference in data processed between the two queries (2.3 GB for the explicit value vs 138.1 GB for the SELECT MAX() expression).
Is there an optimization, plan, or directive that can be applied to the preferred form of the main query that will reduce the data processing cost? Thank you for any assistance that can be provided.
Main Query (hard-coded version, processes 2.3 GB)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > '20201015'
AND _TABLE_SUFFIX < '20201025'
Main Query (preferred form, processes 138.1 GB without optimization)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`)
AND _TABLE_SUFFIX < '20201025'
Upvotes: 1
Views: 1128
Reputation: 172993
You can use scripting for this
The "trick" is in pre-computing
DECLARE start_date STRING;
SET start_date = (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`);
and assigning to variable and then use this variable in where clause on main query - in this case it will use cost effective version
AND _TABLE_SUFFIX > start_date
AND _TABLE_SUFFIX < '20201025'
Upvotes: 2