Reputation: 1
We are building a Data Warehouse solution in Google BigQuery using Apache Airflow to transfer MySQL tables to the cloud. I have daily transfers up and running fetching data incrementally using
MySQLToGoogleCloudStorageOperator => GoogleCloudStorageTBigQueryOperator
with that bit in .sql:
WHERE timestamp BETWEEN '{{ macros.ds_add(ds, -1) }}' AND '{{ ds }}'
It's working perfectly but we have one table which we would like to update MONTHLY, on 10th of each month but fetching data from the whole previous month - so When ran with an execution date of 2019-03-10 we want to fetch
timestamp BETWEEN '2019-02-01 00:00:00' AND '2019-03-01 00:00:00'
and so on. What's the best and simplest way to do that?
Upvotes: 0
Views: 1487
Reputation: 724
Could you not simply play with the execution date? -> first day of last month
{{ ds.ds_add(-30).replace(day=1) }}
-> first days of following month
{{ ds.replace(day=1) }}
Upvotes: 0
Reputation: 5518
#standardSQL
SELECT
CURRENT_DATE() as today,
DATE_TRUNC(CURRENT_DATE(), MONTH) as first_day_this_month,
DATE_SUB(DATE_TRUNC(CURRENT_DATE(), MONTH), INTERVAL 1 MONTH) as first_day_last_month;
Upvotes: 0
Reputation: 1651
Have you considered Cloud BQ partitions?
You can create monthly partitions (which are dynamic in nature) and only patch data for the month you wish to mutate on Cloud BQ.
All of this can be very neatly orchestrated with a coupling between Cloud Composer and Cloud Functions.
Upvotes: 0