Reputation: 11
The _TABLE_SUFFIX feature is great and exactly what I was looking for to solve my problem - however it is scanning all of the data matched by the wildcard when I use a sub-query to determine which tables to match on.
If you do an operation such as = or BETWEEN or IN with a set of values on _TABLE_SUFFIX, you can see the amount of data being scanned goes down compared to simply a wildcard:
SELECT sample_data FROM `test.dataset.*`
WHERE _TABLE_SUFFIX IN ("NWD1","NWD2","NWD3","NWD4","NWD5")
- 1.8 GB scanned
However if I do the following:
SELECT sample_data FROM `test.dataset.*`
WHERE _TABLE_SUFFIX IN (SELECT ID FROM subset)
- 50GB scanned (this sub-select contains the same values as shown in the explicit IN clause)
Upvotes: 1
Views: 1365
Reputation: 4736
It is not about the subqueries per se - if the limiting data sits in the query it is all fine - even when using subqueries:
CREATE TEMP FUNCTION daterange(suf string) as ( suf between '04' and '06');
WITH vars AS (
SELECT
'04' as startDate,
'06' as endDate,
['04', '05', '06'] as daterange
)
SELECT
*
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_201611*` t -- 513.6
WHERE
_TABLE_SUFFIX between '04' and '06' -- 39.4
--(select _table_suffix between startDate and endDate from vars) -- 39.4
--daterange(_table_suffix) -- 39.4
--_table_suffix in unnest( (select daterange from vars) ) -- 39.4
hth!
Upvotes: 1
Reputation: 59175
Constant filters on _TABLE_SUFFIX
will reduce the amount of data queried, but not if those filters are coming from a dynamic subquery.
As an alternative - have you considered clustered storage?
Clustered tables are able to optimize the amount of data queried, even when filtering with a dynamic subquery.
SELECT sample_data FROM `test.dataset.*`
WHERE clustered_column IN (SELECT ID FROM subset)
would work.
For example
SELECT MAX(title), wiki
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE DATE(datehour) = '2018-01-10'
AND wiki IN (
SELECT wiki
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE DATE(datehour) = '2018-01-01'
AND NOT wiki LIKE 'e%'
LIMIT 3
)
GROUP BY 2
queries 0.341 GB, instead of 10 GB
Upvotes: 3
Reputation: 5503
Maybe the best you can do is to generate IN clause with a query like below then form another query with generated IN clause:
select concat('IN ("', string_agg(ID, '", "'), '")') as in_clause
from subset;
Upvotes: 1