Reputation: 91
I am using getdbt on redshift for data analytics operation. Can anyone please suggest, how to use --selector & --defer with "dbt run" commands. What is the syntax ? What is the use of selectors.yml file? Please share some examples.
Thanks
Upvotes: 0
Views: 1603
Reputation: 2763
My interpretation of defer
is a way to utilize the dbt cli to work with unbuilt or differential versions of the current & futures state defined versions of a model.
Example of why you may want to interact with that here: #2740 - Automating Non Regression Test
selectors
being a relatively new feature, I also haven't seen much documentation to back this up but it is effectively a naming convention for a set of logical criteria (more than 1 tag, multiple directories, etc.)
I'd recommend this article in general for understanding the build path generation of a typical dbt run: How we made dbt runs 30% faster
From there, you can imagine that within a large project, there are huge interconnecting chains for each raw -> analytics ready transformation pipeline that you have.
We'll use Gitlab's open dbt project as an example.
Gitlab doesn't currently use selectors but they do make use of tags.
So they could build up a selectors.yml
file using logical definitions like:
selectors.yml
selectors:
- name: sales_funnel
definition:
tag: salesforce
tag: sales_funnel
- name: arr
description: builds all arr models to current state + all upstream dependencies (zoho, zuora subscriptions, etc.)
default: true
definition:
tag: zuora_revenue
tag: arr
- name: month_end_process
description: builds reporting models about customer segments based on subscription activity for latest closed month
definition:
- union:
- method: fqn
value: rpt_available_to_renew_month_end
greedy: eager # default: will include all tests that touch selected model
- method: fqn
value: rpt_possible_to_churn_month_end
greedy: eager
Full list of valid selector definitions here: https://docs.getdbt.com/reference/node-selection/yaml-selectors#default
What that gives them the ability to do is on a cron job, via airflow, or some other orchestrator simply execute:
dbt run --selector month_end_process --full-refresh
And have confidence that the logical selection of models to run for that process is 100% accurately reproduced instead of another more fallible approach like assuming that all the models needed are in a single directory:
dbt run --models marts.finance.restricted_safe.reports --full-refresh
Architecturally, you likely won't need selectors until you get to the level of having multiple layers of tags and / or multiple layers of use-case directories to be mindful of within a single run.
Example: tags for the models' function, tags for the sources, tags for the bi/analyst consumers, tags for the materialization schedule, etc.
Upvotes: 1