Reputation: 29
I am currently working on an ETL pipeline that uses BigQuery to store staging data, and then uses Dataprep to transform the data and store it in new BigQuery tables for production.
We have been experiencing issues finding the most cost effective way to apply these transforms on a small selection of the data, typically only the last X number of days from the current max date in the staging data table. For example, we need to calculate the max available date in the staging data, and then retrieve all rows within the past 3 days from this date. Unfortunately we can't rely on the 'max date' in the staging data always being up to date (this data is brought in from third party APIs of varying quality and reliability).
At first I tried applying these transforms directly in Dataprep by getting the max date, creating a comparison column using DATEDIFF and then discarding rows more than 3 days older than this 'max date'. This proved to be very time consuming and inefficient in terms of cost.
The next thing we tried was to filter down the data in BigQuery views, which would then be used as the initial datasets for the Dataprep flows (the data would be pre-filtered before Dataprep applies any transforms). We first tried doing this dynamically in BigQuery, like so:
WITH latest_partitiontime AS (SELECT _PARTITIONTIME as pt FROM
`{project}.{dataset}.{table}`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME DESC
LIMIT 1)
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= (SELECT pt FROM latest_partitiontime)
But upon preview of the GB/estimated cost of the query, it seems very inefficient and expensive.
The next thing we tried was hard coding the date, which for some reason is a lot cheaper/quicker:
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= '2018-08-08'
So our current plan is to maintain a view for each table, and update the hard coded date in the view SQL via the Python SDK each time the staging data successfully completes (https://cloud.google.com/bigquery/docs/managing-views).
It feels like we are potentially missing a much easier/more efficient solution to this problem. So I wanted to ask:
Upvotes: 2
Views: 2862
Reputation: 208022
Are you familiar with the MERGE statement of standard SQL and the clustering feature released? that could actually merge your data and you can further customize it to read only some partitions.
Example from manual:
MERGE dataset.DetailedInventory T
USING dataset.Inventory S
ON T.product = S.product
WHEN NOT MATCHED AND quantity < 20 THEN
INSERT(product, quantity, supply_constrained, comments)
VALUES(product, quantity, true, ARRAY<STRUCT<created DATE, comment STRING>>[(DATE('2016-01-01'), 'comment1')])
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
hint: you can partition by null
, and leverage only the 'clustering level'
Upvotes: 3