Dominic Woodman
Dominic Woodman

Reputation: 839

Pruning BigQuery partitions with Data studio

I've got an almost identical scenario to this question:

How to choose the latest partition in BigQuery table?

With one additional complication. I need the result to display in Data Studio.

The setup

I've got a series of data sets which appear at different time intervals apart and I need to get the most recent partition. Because they're inconsistent time periods apart I can't just get the last day and use that.

I can use BigQuery scripting to successfully prune the queries with a dynamic query, but when I move this query into Data Studio the query doesn't load properly.

The table loads in the data sources part:

enter image description here

But when I actually try to use it in the report:

Data Studio cannot connect to your data set.

Failed to fetch data from the underlying data set

Error ID: e6546a97

Is there a way to get Data Studio to display this properly with pruning?

Example query

DECLARE max_date DATE;
SET max_date = (SELECT DATE(MAX(_partitiontime)) FROM `dataset.table`);   

SELECT *
FROM `dataset.table`
WHERE DATE(_partitiontime ) = max_date

Upvotes: 4

Views: 1150

Answers (2)

Cylldby
Cylldby

Reputation: 1978

You have access to all your partitions information in the project.dataset.INFORMATION_SCHEMA.PARTITIONS table. Therefore you can try something like:

SELECT *
FROM `dataset.table`
WHERE DATE(_PARTITION_TIME) = (
    SELECT 
        MAX(SAFE.PARSE_DATE('%Y%m%d',partition_id))
    FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
    WHERE TABLE_NAME = "table"
)

Make sure you understand that pruning works with the above query:

enter image description here

Upvotes: 1

Cylldby
Cylldby

Reputation: 1978

Workaround:

A possibility is to use date parameters and make a query like the following:

SELECT *
FROM `dataset.table`
WHERE DATE(_PARTITION_TIME)>= PARSE_DATE("%Y%m%d", @DS_START_DATE)

This does not precisely the answer, but with a date range defaulted to "yesterday to today" data you effectively prune your table to only recent partitions. In the case that the data is irregular as you mention it, users still have the possibility to manually extend the date range until they find the data.

In parallel, you can also add the following custom query to your data sources:

SELECT 
    MAX(SAFE.PARSE_DATE('%Y%m%d',partition_id)) AS latest_available_partition
FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "table"

and display it into a table to provide the information to users.

Indeed, this workaround implies that you trust your end users not to play too much with the date range.

Upvotes: 2

Related Questions