Arthur
Arthur

Reputation: 41

How to specify model schema when referencing another dbt project as a package? (dbt multi-repo setup)

We're using a dbt multi-repo setup with different projects for different business areas. We have several projects, something like this:

The dbt_dwh project contains models which we plan to reference in projects 1 and 2 (we have ~10 projects that would reference the dbt_dwh project) by way of installing git packages. Ideally, we'd like to be able to just reference the models in the dbt_dwh project (e.g. SELECT * FROM {{ ref('dbt_dwh', 'model_1') }}). However, each of our projects sits in it's own database schema and this causes issue upon dbt run because dbt uses the target schema from dbt_project_x, where these objects don't exist. I've included example set-up info below, for clarity.

packages.yml file for dbt_project1:

packages:
  - git: https://git/repo/url/here/dbt_dwh.git
    revision: master

profiles.yml for dbt_dwh:

dbt_dwh:
  target: dwh_dev
  outputs:
    dwh_dev:
      <config rows here>
    dwh_prod:
      <config rows here>

profiles.yml for dbt_project1:

dbt_project1:
  target: project1_dev
  outputs:
    project1_dev:
      <config rows here>
    project1_prod:
      <config rows here>

sf_orders.sql in dbt_dwh:

{{
  config(
    materialized = 'table',
    alias = 'sf_orders'
    )
}}

SELECT * FROM {{ source('salesforce', 'orders') }} WHERE uid IS NOT NULL

revenue_model1.sql in dbt_project1:

{{
  config(
    materialized = 'table',
    alias = 'revenue_model1'
    )
}}

SELECT * FROM {{ ref('dbt_dwh', 'sf_orders') }}

My expectation here was that dbt would examine the sf_orders model and see that the default schema for the project it sits in (dbt_dwh) is dwh_dev, so it would construct the object reference as dwh_dev.sf_orders.

However, if you use command dbt run -m revenue_model1 then the default dbt behaviour is to assume all models are located in the default target for dbt_project1, so you get something like:

11:05:03  1 of 1 START sql table model project1_dev.revenue_model1 .................... [RUN]
11:05:04  1 of 1 ERROR creating sql table model project1_dev.revenue_model1 ........... [ERROR in 0.89s]
11:05:05
11:05:05  Completed with 1 error and 0 warnings:
11:05:05  
11:05:05  Runtime Error in model revenue_model1 (folder\directory\revenue_model1.sql)
11:05:05    404 Not found: Table database_name.project1_dev.sf_orders was not found

I've got several questions here:

Some points to note:

Upvotes: 2

Views: 7756

Answers (3)

Omar2535
Omar2535

Reputation: 98

I had a similar use case as you. In our case, we could to explicitly define the +schema for the upstream packages instead of the downstream one. Something like this:

dbt_dwh: dbt_project.yml

models:
  dbt_dwh:
    marts:
      +schema: *[define these here]*
      ...

then in our downstream packages, we could simply use ref like: ref('dbt_dwh', 'mart_my_table')

Of course, the answer from the original author also works.

Upvotes: 2

I faced this same issue today.

After few hours reading and trying stuff I made it work by specifying the database (GCP project) in the config block in the first model (the one living in dbt_dwh project I guess).

It should look like this:

{{
  config(
    materialized = 'table',
    alias = 'sf_orders',
    database = 'dbt_dwh',
    schema = 'bq_dataset_name' # Not sure if this one is needed but worth trying
    )
}}

That worked for me. Maybe you can also try specifying the database in one of the schema.yml files or even the dbt_project.yml file or profiles.yml file.

Upvotes: 0

Arthur
Arthur

Reputation: 41

I managed to find a solution so I'll answer my own question in case anybody else runs up against the same issue. Unfortunately this is not documented anywhere that I can find, however, a throw-away comment in the dbt Slack workspace sparked an idea that allowed me to find the/a solution (I'll post the message if I manage to find it again, to give credit where it's due).

To fix this is actually very simple, you just need to add the project being imported to your dbt_project.yml file and specify the schema. For our use case this is fine as we only have 1 schema we use.

dbt_project.yml for dbt_project1:

models:
  db_project_1:
  outputs:
    project1_dev:
      <configs here>
    project1_prod:
      <configs here>
  dbt_dwh:
    +schema: [[schema you want these models to run into]]
    <configs here>

The advantages with this approach are:

  • When you generate/serve dbt docs it allows you to see the upstream lineage from the upstream project
  • If there are any upstream dependencies in your upstream project you can run this using dbt run -m +model_name (this can be super handy)
    • If you don't want this behaviour then you can use dbt run -m +model_name --exclude dbt_dwh (for example) to prevent models in your upstream project from running.

I haven't yet figured out if it is possible to use the default parameters/settings for models inside the upstream project (in this case dbt_dwh) but I will edit this answer if I find a way.

Upvotes: 1

Related Questions