Alexander Baumann
Alexander Baumann

Reputation: 11

Wildcard table matches with _TABLE_SUFFIX and sub-query

The _TABLE_SUFFIX feature is great and exactly what I was looking for to solve my problem - however it is scanning all of the data matched by the wildcard when I use a sub-query to determine which tables to match on.

If you do an operation such as = or BETWEEN or IN with a set of values on _TABLE_SUFFIX, you can see the amount of data being scanned goes down compared to simply a wildcard:

SELECT sample_data FROM `test.dataset.*` WHERE _TABLE_SUFFIX IN ("NWD1","NWD2","NWD3","NWD4","NWD5") - 1.8 GB scanned

However if I do the following:

SELECT sample_data FROM `test.dataset.*` WHERE _TABLE_SUFFIX IN (SELECT ID FROM subset) - 50GB scanned (this sub-select contains the same values as shown in the explicit IN clause)

Upvotes: 1

Views: 1365

Answers (3)

Martin Weitzmann
Martin Weitzmann

Reputation: 4736

It is not about the subqueries per se - if the limiting data sits in the query it is all fine - even when using subqueries:

CREATE TEMP FUNCTION daterange(suf string) as ( suf between '04' and '06');

WITH vars AS (
  SELECT
    '04' as startDate,
    '06' as endDate,
    ['04', '05', '06'] as daterange
  )

SELECT 
  *
FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_201611*` t -- 513.6
WHERE 
  _TABLE_SUFFIX between '04' and '06' -- 39.4
  --(select _table_suffix between startDate and endDate from vars) -- 39.4
  --daterange(_table_suffix) -- 39.4
  --_table_suffix in unnest( (select daterange from vars) ) -- 39.4

hth!

Upvotes: 1

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

Constant filters on _TABLE_SUFFIX will reduce the amount of data queried, but not if those filters are coming from a dynamic subquery.

As an alternative - have you considered clustered storage?

Clustered tables are able to optimize the amount of data queried, even when filtering with a dynamic subquery.

SELECT sample_data FROM `test.dataset.*`
  WHERE clustered_column IN (SELECT ID FROM subset)

would work.

For example

SELECT MAX(title), wiki
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` 
WHERE DATE(datehour) = '2018-01-10'
AND wiki IN (
  SELECT wiki 
  FROM  `fh-bigquery.wikipedia_v3.pageviews_2018`   
  WHERE  DATE(datehour) = '2018-01-01'
  AND NOT wiki LIKE 'e%'
  LIMIT 3
)
GROUP BY 2 

queries 0.341 GB, instead of 10 GB

Upvotes: 3

Yun Zhang
Yun Zhang

Reputation: 5503

Maybe the best you can do is to generate IN clause with a query like below then form another query with generated IN clause:

select concat('IN ("', string_agg(ID, '", "'), '")') as in_clause 
from subset;

Upvotes: 1

Related Questions