Umar.H
Umar.H

Reputation: 23099

using a dbt source with references to multiple

I have a datawarehouse that ingests data from multiple databases, these sit in the same database in my warehouse but logically split into different schemas.

Now if i have the following model.

#my_cool_model.sql
SELECT a.ID, b.DATE
FROM {{ source(var('environment'), 'table_a') }} a 
INNER JOIN {{ source((var('environment'), 'table_b') }} b 

and a sources.yaml that refers to my databases sources.

#sources.yaml
version: 2

sources:

  - name: staging
    database: staging
    schema: sales
    tables:
     - name: table_a


   # what I've tried.
    schema: finance
    tables:
      - name: table_b

this is a simple abstraction but it illustrates my point.

when I try to compile or run my model

dbt compile -m my_cool_model --vars "{'environment' : 'staging'}" 

I'll get an error that states that this model relies on a model where table_a does not exist which I assume means that my table_a references is being over-written.

before I go down the path of refactoring how my data is stored in the warehouse is there a nice work around or is this a limitation of dbt.

Upvotes: 3

Views: 4721

Answers (1)

Adam Kipnis
Adam Kipnis

Reputation: 10991

Since these are two different schemas, they should be treated as two different sources. The source_name you pass into source() should not be driving the environment. You can do that by either using Jinja within the yml or by overriding source() with your own macro (which I wouldn't recommend in this case).

Example:

version: 2

sources:
  - name: sales
    database: "{{ env_var('environment') }}"
    schema: sales # Not needed since it matches the source name
    tables:
      - name: table_a

  - name: finance
    database: "{{ env_var('environment') }}"
    schema: finance
    tables:
      - name: table_b
#my_cool_model.sql
SELECT a.ID, b.DATE
FROM {{ source('sales', 'table_a') }} a 
INNER JOIN {{ source('finance', 'table_b') }} b 

Also, just one comment on best practices: I would recommend splitting the sources into their own ymls and creating base models for each, then ref() those models in my_cool_model.sql. Your base models should know the specific configurations of where sources come from, then your downstream models don't need to worry about it.

EDIT to elaborate on "best practices": It's a best practice because you don't want your transformation stage models to have to know about where the data originated from. It just needs to know it has the data available to it. DBT provides their own recommendations here. Personally, I do things a little different:

  1. Base models: Only knows about sources. One model, one source. Provides simple data type conversions, establishes field name conventions, interprets enum values, etc. No transformations or joins with other sources/models.
  2. Clean models: Removes garbage data from your base models. Possible simple data enrichment.
  3. Staging (or intermediate) models: Transformations.
  4. Mart: Final destination. Facts and dimension tables.

Upvotes: 3

Related Questions