Reputation: 3
I have an incremental DBT model with insert_override scheme. Model runs hourly through Airflow DAG incrementally by deleting and updating past 2 days of data. In case of some source schema change or missing data in derived table, an ad-hoc DAG is triggered to run the model with --full-refresh flag.
Full refresh seems to be the only option in case of schema change, however, in case when there is just one day's data missing in derived table, full refresh seems costly.
Every time that I want to do full-refresh, I change Airflow variable for ad-hoc DAG as follows and then trigger the DAG:
{
"_comment": "NOTE: after changing values here, refresh the DAG from the Airflow UI.",
"dbt_parser_list": [
"--select", "model_name", "--full-refresh"
],
"dbt_commands": [
"run"
]
}
I want to be able to run ad-hoc DAG such that, instead of full-refresh, I could pass a date so that only the data after that date is refreshed. This is useful in my case as I am generally only concerned about current month's data. So, if data for 10-Apr-2024 is not available in derived tables, I want to be able to pass 01-Apr-2014 and get refreshed data post 01-Apr-2024. Otherwise, if I want to do full-refresh, I can pass 01-01-2020 which is default start date for full refresh.
Is there a way to do it?
I'm new to DBT. I could find that there's -vars flag for code generalization but not sure how to use it.
Upvotes: 0
Views: 814
Reputation: 2089
You should be able to use a project-defined variable, and play around with the --vars
flag in your dbt commands.
For instance, let's say that you add a variable in your dbt_project.yml
file like the following:
# dbt_project.yml
vars:
incremental_cutoff_date: '(select max(<your_model_timestamp>) from {{ this }})'
Now, currently, your incremental model's is_incremental
bit probably looks like the following:
-- your_incremental_model.sql
(...)
{% if is_incremental() %}
where <your_model_timestamp> > (select max(<your_model_timestamp>) from {{ this }})
{% endif %}
You can adjust it, to use the new-defined variable instead:
{% if is_incremental() %}
where <your_model_timestamp> > {{ var('incremental_cutoff_date') }}
{% endif %}
So, for your new dbt job to work, you only need to pass the preferred date value in your command:
dbt run --select model_name --vars "incremental_cutoff_date: '''2024-04-01'''"
(there's probably a more elegant way to escape the single quotes)
The variable specified in your command will always prevail over the one defined in your dbt_project.yml.
So, to finish up, when you don't set any --vars flag in your commands, the is_incremental
SQL bit of your model should be compiled to:
where <your_timestamp> > (select max(<your_timestamp>) from database.schema.model)
And when a variable is defined in your commands:
dbt run --select model_name --vars "incremental_cutoff_date: '''2024-04-01'''"
the is_incremental
SQL bit of your model should be compiled to:
where <your_timestamp> > '2024-04-01'
Upvotes: 0