Jeremy
Jeremy

Reputation: 325

BigQuery Query Latest Table Efficiently

I'm trying to efficiently query from the latest table in a dataset that consists of tables of the form project_id:dataset:dataset_20160101, project_id:dataset:dataset_20160102 etc.

This query seems to be the recommended solution:

SELECT * 
FROM `project_id.dataset.*`
WHERE _TABLE_SUFFIX=(SELECT MAX(table_id) FROM `project_id.dataset.__TABLES_SUMMARY__`)

However, this query bills me for accessing all tables in the dataset, not just the latest one. Why is that?

Upvotes: 4

Views: 280

Answers (2)

xuejian
xuejian

Reputation: 195

For now, if you have a pipeline which inserts tables, then at the end of the pipeline, you could (1) get the MAX(table_id) (2) create a view with following query:

SELECT
        *,
        DATE max_table_id AS _LATEST_DATE,
        PARSE_DATE('%E4Y%m%d', _TABLE_SUFFIX) AS _DATA_DATE
FROM `project_id.dataset.*`

Then whenever you want to query the data, you could just query against the view, and specify _DATA_DATE=_LATEST_DATE as the filter. In this case, _LATEST_DATE is pre-calculated, so it's constant. As Hua explained, you will be billed for processing only the latest table.

Upvotes: 0

Hua Zhang
Hua Zhang

Reputation: 1551

We can only prune the tables before the query runs when the WHERE clause uses a constant expression on the pseudo column, e.g., _TABLE_SUFFIX = 'dataset_20160102'. For your query, as the WHERE clause includes a sub-uery which doesn't parse to a constant, we cannot prune the tables before the query runs. Instead, data is read from all tables and the sub-query is executed. Then data is joined with the sub-query results and filtered.

It's possible to prune the tables during the query execution. Start the query, execute the sub-query, prune the tables, and read data. But there's no ETA for it yet.

Upvotes: 0

Related Questions