Optimising a sequence of SQL calculations - Nested or Seperate Queries?

Question

Short Intro:

When it is required to have a dozen nested calculating queries, is it more optimal to

A) Perform each operation separately (saving into a table for each result and then reading that table for the next query)
B) Have a large set of nested selects

Full Description:

I am trying to calculate some advanced forecasts from a series of input tables in SQL.

I am building around a dozen 'modules' that are separated into their own schema and each module typically includes 4-10 input tables and 6-10 calculation steps. All outputs from each module is dumped into the same output table once completed.

Queries range from 7k-200k rows.

A single schema's/module's tables might look like this:

Input Table 1
Input Table 2
Input Table 3
Input Table 4
Calculation Query 1 Result Table
Calculation Query 2 Result Table
Calculation Query 3 Result Table
Calculation Query 4 Result Table
Calculation Query 5 Result Table
Calculation Query 6 Result Table
Final Output

Each calculation query uses the results of the previous (for the most part). The final output is the result of the final calculation query. Calculations are not very complex: partitioned max, basic formula (+,-,*,/) or SUM etcetera. Normally only 1-3 of these per calculation step and always on the same column.

The main reason this is split into multiple calculation queries (instead of one super-formula) is because each calculation joins the outputs in a different way and uses different input tables; also because some are based on previous row results. (Such as max partitions or Lag)

My requirements are as follows:

A procedure that calculates final output from step 1 and merges into Final Output.
A procedure that calculates up to the selected calculation query and merges into its respective results table (and stop). Consider this the 'overriding final'

I DONT need to store the calculation results of intermediate queries - only the final output or the 'overriding final' if selected.

My Problem: I am trying to optimise the entire process - at this point it looks like it will take around 10-15 seconds. I want it to be 1 second - however I appreciate this is probably not possible.

What I have tried:

Firstly, I created a single procedure for each calculation query that Merged the results into its respective output table. Using this method, each calculation query must read from the database and then merge into its output.

I tried temp tables however I don't see why this would be optimal because I have existing tables for the calculation steps already - which are indexed with the next step in mind.

I then made an assumption that it would be faster to simply nest all the queries into one super-procedure or maybe even have a sequence of Table-Functions.

My Question:

However I ran into a thought that I could not find an answer for - which is the following:

Inserting results into a table on every calculation step might slow the process (especially as they are indexed with 2-4 columns); but at least the data will be indexed for the next step.
Nesting selects would save the effort of inserting data but these results wouldn't be indexed? Right? Or Wrong?

Are select results intelligently indexed? And given my scenario what advise would you give on how I approach this. Maybe I am missing something really simple.

Additional Info:

Most of my larger query results (150-200K) have 4 columns that need to be indexed.
All of my tables only have one column that needs calculating - the rest are indexed.

For Example: ForecastID, Group, Year, Type, Sub-Type, Value

So I have to index Group, Year, Type and Sub-Type to Join multiple input tables and then calculate on the Value column.

I am telling you this in case having index-heavy tables influences your advice - I wont ask for help on optimizing indexes here due to the overwhelming quantity of advice already available and because it's a different question!

Randall · Accepted Answer

Query optimization is often more art than science, there are few hard and fast rules because there are so many possible influences on the outcome. With that big caveat out of the way, Time to hit the high points.

Indexes effects on loading tables - Indexes have a similar performance impact on inserts as triggers. Unless you have a filtered index each insert will have to update every index on the table, so at three indexes you are looking at quadrupling the number of updates per insert. At one read per insert and a small table size of 200k (very doable for a table scan), for three indexes you are probably outside the butter zone for cost vs. benefit of having those indexes on your work tables.

Nesting results - Like CTEs, nested results work best when the entire result set can fit in memory. When part is in memory and part is on disk it will generally perform worse than a similarly sized temp table without an index. At 5 or so columns for 200k rows with smallish datatypes and a modern server you should be ok performance wise with nesting queries, so long as your only doing one result set at a time. Once again this varies based on your setup, if you are strapped for ram drop them into a temp table.

Joins - Another possible good reason to use temp tables/nested queries is to avoid excessively large joins. The first step in a join process is a full Cartesian join between the tables, which is then filtered based on the on and where clauses. The Join process is heavily optimized in all RDMS, so most of the time you are not aware of how much heavy lifting is occurring behind the scenes, however when tables reach large sizes this can be a major performance pain point. So instead you select the subset of data you require from both tables, and join the two much smaller sets. Once again the butter zone between subsets and full table joins depends on a number of factors, so you'll have to play around with your queries to find where it is for your situation.

Unfortunately I can't really give specific advice without some sample inputs and outputs and/or an execution plan, but I hope this is some food for thought. Good luck.

Optimising a sequence of SQL calculations - Nested or Seperate Queries?

Answers (2)

Related Questions