Reputation: 11
I'm working with a large dataset (50 tables) that gets updated regularly, and I'm trying to use incremental models in dbt. However, when I run dbt run
in "incremental mode," the process doesn't finish—even after running for an entire day.
Details:
Here’s a snippet of my incremental model:
{{
config(
unique_id = ['civil_id']
)
}}
select *
from {{ source('row', 'pop_person') }}
{% if is_incremental() %}
where
civil_id not in (
select
civil_id
from {{ this }}
)
{% endif %}
I've configured my models to be incremental in the dbt_project.yml:
models:
stg:
+materialized: incremental
When I ran a --full-refresh
, it completed in 15 minutes, which still feels long given the data size.
I'm using the source function in these incremental models. Does anyone have suggestions on what might be causing the long run times or how I can optimize this?
Here's a rough diagram of my setup: Structure Thanks in advance for your help!
I tried running in full refresh, and disabled elementary package. I want the dbt run in incremental mode to take 3 minutes max
Upvotes: 1
Views: 187
Reputation: 1
First of all better to use table partitioning, if there is no date key then ensure that civil_id and any other columns used in the WHERE clause or joins are indexed. This will significantly speed up the DELETE and SELECT operations.
Instead of using NOT IN, consider using a LEFT JOIN or EXCEPT which might be more performant, especially on large datasets. Here's an alternative:
{% if is_incremental() %}
where not exists (
select 1
from {{ this }}
where {{ this }}.civil_id = source.civil_id
)
{% endif %}
Upvotes: 0