Robert
Robert

Reputation: 541

Optimising a sequence of SQL calculations - Nested or Seperate Queries?

Short Intro:

When it is required to have a dozen nested calculating queries, is it more optimal to

Full Description:

I am trying to calculate some advanced forecasts from a series of input tables in SQL.

I am building around a dozen 'modules' that are separated into their own schema and each module typically includes 4-10 input tables and 6-10 calculation steps. All outputs from each module is dumped into the same output table once completed.

Queries range from 7k-200k rows.

A single schema's/module's tables might look like this:

Each calculation query uses the results of the previous (for the most part). The final output is the result of the final calculation query. Calculations are not very complex: partitioned max, basic formula (+,-,*,/) or SUM etcetera. Normally only 1-3 of these per calculation step and always on the same column.

The main reason this is split into multiple calculation queries (instead of one super-formula) is because each calculation joins the outputs in a different way and uses different input tables; also because some are based on previous row results. (Such as max partitions or Lag)

My requirements are as follows:

I DONT need to store the calculation results of intermediate queries - only the final output or the 'overriding final' if selected.

My Problem: I am trying to optimise the entire process - at this point it looks like it will take around 10-15 seconds. I want it to be 1 second - however I appreciate this is probably not possible.

What I have tried:

Firstly, I created a single procedure for each calculation query that Merged the results into its respective output table. Using this method, each calculation query must read from the database and then merge into its output.

I tried temp tables however I don't see why this would be optimal because I have existing tables for the calculation steps already - which are indexed with the next step in mind.

I then made an assumption that it would be faster to simply nest all the queries into one super-procedure or maybe even have a sequence of Table-Functions.

My Question:

However I ran into a thought that I could not find an answer for - which is the following:

Are select results intelligently indexed? And given my scenario what advise would you give on how I approach this. Maybe I am missing something really simple.

Additional Info:

For Example: ForecastID, Group, Year, Type, Sub-Type, Value

So I have to index Group, Year, Type and Sub-Type to Join multiple input tables and then calculate on the Value column.

I am telling you this in case having index-heavy tables influences your advice - I wont ask for help on optimizing indexes here due to the overwhelming quantity of advice already available and because it's a different question!

Upvotes: 1

Views: 994

Answers (2)

g2server
g2server

Reputation: 5367

It sounds like your datasets from the subqueries are more than a few thousand rows, so I would start off with approach A, persist some of these intermediate result sets to #temptables, check the execution plan for scans on these tables, and index the #temptables if needed.

If you want to use approach B, or mix A and B, I suggest CTEs instead of nested queries where possible. They are more readable, and it is easier to switch to #temptables when you are testing/designing the query.

Upvotes: 0

Randall
Randall

Reputation: 1521

Query optimization is often more art than science, there are few hard and fast rules because there are so many possible influences on the outcome. With that big caveat out of the way, Time to hit the high points.

Indexes effects on loading tables - Indexes have a similar performance impact on inserts as triggers. Unless you have a filtered index each insert will have to update every index on the table, so at three indexes you are looking at quadrupling the number of updates per insert. At one read per insert and a small table size of 200k (very doable for a table scan), for three indexes you are probably outside the butter zone for cost vs. benefit of having those indexes on your work tables.

Nesting results - Like CTEs, nested results work best when the entire result set can fit in memory. When part is in memory and part is on disk it will generally perform worse than a similarly sized temp table without an index. At 5 or so columns for 200k rows with smallish datatypes and a modern server you should be ok performance wise with nesting queries, so long as your only doing one result set at a time. Once again this varies based on your setup, if you are strapped for ram drop them into a temp table.

Joins - Another possible good reason to use temp tables/nested queries is to avoid excessively large joins. The first step in a join process is a full Cartesian join between the tables, which is then filtered based on the on and where clauses. The Join process is heavily optimized in all RDMS, so most of the time you are not aware of how much heavy lifting is occurring behind the scenes, however when tables reach large sizes this can be a major performance pain point. So instead you select the subset of data you require from both tables, and join the two much smaller sets. Once again the butter zone between subsets and full table joins depends on a number of factors, so you'll have to play around with your queries to find where it is for your situation.

Unfortunately I can't really give specific advice without some sample inputs and outputs and/or an execution plan, but I hope this is some food for thought. Good luck.

Upvotes: 2

Related Questions