Reputation: 23099
I have a datawarehouse that ingests data from multiple databases, these sit in the same database in my warehouse but logically split into different schemas.
Now if i have the following model.
#my_cool_model.sql
SELECT a.ID, b.DATE
FROM {{ source(var('environment'), 'table_a') }} a
INNER JOIN {{ source((var('environment'), 'table_b') }} b
and a sources.yaml
that refers to my databases sources.
#sources.yaml
version: 2
sources:
- name: staging
database: staging
schema: sales
tables:
- name: table_a
# what I've tried.
schema: finance
tables:
- name: table_b
this is a simple abstraction but it illustrates my point.
when I try to compile or run my model
dbt compile -m my_cool_model --vars "{'environment' : 'staging'}"
I'll get an error that states that this model relies on a model where table_a
does not exist which I assume means that my table_a
references is being over-written.
before I go down the path of refactoring how my data is stored in the warehouse is there a nice work around or is this a limitation of dbt.
Upvotes: 3
Views: 4721
Reputation: 10991
Since these are two different schemas, they should be treated as two different sources. The source_name
you pass into source()
should not be driving the environment. You can do that by either using Jinja within the yml or by overriding source()
with your own macro (which I wouldn't recommend in this case).
Example:
version: 2
sources:
- name: sales
database: "{{ env_var('environment') }}"
schema: sales # Not needed since it matches the source name
tables:
- name: table_a
- name: finance
database: "{{ env_var('environment') }}"
schema: finance
tables:
- name: table_b
#my_cool_model.sql
SELECT a.ID, b.DATE
FROM {{ source('sales', 'table_a') }} a
INNER JOIN {{ source('finance', 'table_b') }} b
Also, just one comment on best practices: I would recommend splitting the sources into their own ymls and creating base models for each, then ref()
those models in my_cool_model.sql
. Your base models should know the specific configurations of where sources come from, then your downstream models don't need to worry about it.
EDIT to elaborate on "best practices": It's a best practice because you don't want your transformation stage models to have to know about where the data originated from. It just needs to know it has the data available to it. DBT provides their own recommendations here. Personally, I do things a little different:
Upvotes: 3