PolFrx
PolFrx

Reputation: 75

Is there a way to force a full refresh for a dbt incremental model if its model changed?

I'll implement dbt for pipelines in Snowflake with incremental models to save query costs but I want to manage the changes of schemas that will be quite frequent. I will have one daily ETL job for each env running a dbt run. Also, in qa and prod environments I'll not be able to run any cmd as I don't have access to these environments for security issues, only to dev.

Is it possible to trigger a full refresh of a model if its schema changed?

I saw that we can use the on_schema_change option with incremental models but this will just add (or drop) columns without populating them which is not exactly what I'm looking for as I'll not be able to run a force refresh manually in qa and prod.

Thanks a lot

Upvotes: 5

Views: 23580

Answers (3)

Pablo Carrera Lorenzo
Pablo Carrera Lorenzo

Reputation: 33

How I've solved this issue (performing a full-refresh automatically if the schema of the table changes) is by modifying the tool you use to orchestrate your models and trigger the dbt run. That's where I automatically add the --full-refresh flag if necessary.

For example, if you use Airflow, you can perform a retry with the --full-refresh flag if the first run failed with the "out of sync" error that happens if the source table had schema changes.

It's not the most beautiful solution, but it worked very nicely for me.

Upvotes: 1

Paddy Alton
Paddy Alton

Reputation: 2348

Is it possible to [automatically] trigger a full refresh of a model if its schema changed?

Sadly, the answer is 'no'.

The docs for on_schema_change are unequivocal:

Note: None of the on_schema_change behaviors backfill values in old records for newly added columns. If you need to populate those values, we recommend running manual updates, or triggering a --full-refresh.

I read back through some of the old GitHub issues and the behaviour you desire was discussed, but ultimately not implemented.

So what are your options?

Frequent schema changes sounds bad! Is this because of upstream changes, or because you're iterating the DBT models?

If the former, this becomes partly a communication problem. If your upstream colleagues can't help you even after you explain that this causes difficulties, then you're best off not doing an incremental model; you can explain that the increased costs are the price to support frequent schema changes.

If the latter, hopefully the frequent changes will eventually cease? You could go non-incremental until then (avoid prematurely optimising).

Otherwise: you're going to have to find a way to do occasional full refreshes. Could you deploy a full-refresh version whenever you deploy a schema change, followed by a non-full-refresh version the next day?

In this picture let's imagine

  • you have semantically versioned models, for example you're on v1.2.1 -you want to release v1.3.0, which includes a new column on your incremental table
  • you will deploy a v1.3.0-prerelease with full_refresh=true added to the model config for the updated mode
  • then the next day (after the daily run) deploy v1.3.0 with this removed

If you really have to do frequent schema changes and optimise your costs, this acknowledges that a full refresh ought to somehow be captured in version control, even if in a light touch way.

Upvotes: 4

Lukasz Szozda
Lukasz Szozda

Reputation: 175706

Yes, it is possible. How do I rebuild an incremental model?:

If your incremental model logic has changed, the transformations on your new rows of data may diverge from the historical transformations, which are stored in your target table. In this case, you should rebuild your incremental model.

To force dbt to rebuild the entire incremental model from scratch, use the --full-refresh flag on the command line. This flag will cause dbt to drop the existing target table in the database before rebuilding it for all-time.

$ dbt run --full-refresh --select my_incremental_model+

It's also advisable to rebuild any downstream models, as indicated by t he trailing +.


https://docs.getdbt.com/reference/resource-configs/full_refresh

{{ config(
    full_refresh = true
) }}

select ...

Optionally set a resource to always or never full-refresh.

Upvotes: 4

Related Questions