some_data_guy
some_data_guy

Reputation: 41

Our nightly job in Redshift taking 10-12 hours to run

First post here. Apologies if I leave anything out. I work at a relatively small, <70 person company. We have a nightly job scheduled in DBT that runs in redshift. This job is our attempt at a Last-Touch Attribution model. This job is taking anywhere from 10-12 hours to run, and I just have to assume something is either wrong with our warehouse size or our modeling in DBT.

We have 1 cluster (ra3.xlplus) in AWS and 3 nodes. We are pulling roughly 100MM rows and I basically trying to see if anyone has ideas. Our DBT models are materialized as incremental tables. I have added additional queues to our redshift instance. And that's as much as I've done.

Happy to provide any additional detail if needed.

Cheers all!

Upvotes: 4

Views: 1115

Answers (1)

Bill Weiner
Bill Weiner

Reputation: 11032

Welcome to the forum. There are a number of reasons that things can take a long time but I've seen many issues like this and there are some likely culprits. In general for queries taking this long the issue will most likely be with the query, not the cluster size or the data model - though these can contribute to the poor query.

Most likely (90%): there is some step in the query that is expanding the data during the query. It is likely a JOIN or something that compiles to a JOIN. A step like this can create massive intermediate result sets that can take up a lot of space. One sure indicator that this is happening is if available disk space reduces on the cluster (but this could be different on the ra3 node type). However, you can have this issue without filling disks so another way is to look at the data size (rows) in the EXPLAIN plan especially if there is a loop join step in the plan.

Some chance (50%): The query / data model is moving way too much data around the cluster network. Poor distribution keys or queries that don't work with the data distribution can end up pushing many times your data set size around the cluster network.

Some chance (30%): The query is massive and scans data tables in full multiple times.

Slight chance (20%): Too many queues defined. The more queue slots defined, the less memory is allocated per query. Insufficient memory for the query being run can cause the query to perform poorly. This will show up most when the cluster is busy running queries in the other slots.

Unlikely (10%): You are using many expensive functions in your query. Some built-in functions are complex and slow (but not this slow). If you have UDFs running these can be absolute performance killers.

I know that these don't add up to 100% and this is because you very likely have more than on occurring in your case. For the data size you are stating 10 hours of run time is quite excessive. I really expect you will find that your query is expanding the data massively and forcing the query to swap (spill) to disk. While also distributing this massive data around the cluster over the network.

I've worked with many clients on issues like this so if you need help reading explain plans or understanding what is going on let me know and I'll try to help more.

Upvotes: 3

Related Questions