Reputation: 688
Is it good/ acceptable practice to create an additional step in an INSERT proc, to select relevant data from a source view into a temp table, then use that as the source for the subsequent INSERT statement?
BACKGROUND: I am trying to improve performance on an ETL process that has degraded massively since exponential increase in data volume.
There are c.1500 tables, populated by stored procedures (via SSIS packages) which are created using a handful of templates and customiseable metadata.
Most procs follow a process such as:
INSERT INTO dest_tbl
SELECT col1, col2, col3...
FROM source_view s
LEFT JOIN other_tbl ot on s.col1 = ot.col1
WHERE s.date > ot.prev_date
AND ot.col2 is null
For the worst affected procedures, the source_views are necessarily quite complex, joining 5-10 tables and applying business rules. An example table usually inserts in the region of 80,000 records to a table with 89,000,000 rows.
I have found some poor estimates in query plans, which I had tried to resolve by: -Updating statistics (with FULL SCAN) on tables/ indexes used. -Amending logic in underlying view -Creating new indexes
I don't have a strong background in those things, especially the more complex end of indexing - but had some limited success.
However, when I tried adding a step to insert everything from the view into a temp table before another JOIN / filter/ INSERT, it dramatically improved performance - some tables running as much as 85% faster - eg:
SELECT *
INTO #TMP
FROM source_view
INSERT INTO dest_tbl
SELECT col1, col2, col3...
FROM #TMP s
LEFT JOIN other_tbl ot on s.col1 = ot.col1
WHERE s.date > ot.prev_date
AND ot.col2 is null
Now, in the short-medium term I am happy with this; my critical performance issue is resolved.
My questions are really:
1) Is this considered good/ acceptable practice?
2) Is this likely to be scale-able as data volume continues to increase?
3) Are there any gotchas with this approach I may not have considered?
Or am I likely just delaying the pain of more in-depth query/ index tuning?
EDIT 06/09/2019
Further info following Alan Burnstein's answer:
1) I am confident it's not just the refreshed execution plan causing the improvement in this case. Some procs where I have used this method have been in live for around 6 weeks and have consistently performed much better than the best runs of the previous proc. In test, I cleared the cache before running each method and the old version was always slower even with updated stats/ query plan.
2) Tempdb is split across 6 mdf files and has not been experiencing autogrowth recently. Our server is on a virtual infrastructure, and while I won't pretend to understand much about how it works, our IT team and storage supplier assure me that all DBs/ LUNs will move to faster storage based on activity. Whilst I suspect that could be improved, I don't think it is a major factor in this example - and my hands are fairly tied in terms of changing anything on that set-up.
3) I have tried a few changes to how the view is written without much success, but accept there is probably room for improvement there. The view and underlying tables are pretty much only used during the daily ETL which happens overnight, so there shouldn't be other processes locking and blocking.
4) Both versions are going parallel and there are no user-defined functions or computed columns called by the view.
5) This is possible, but the only parameter passed into this section of the proc is the load date.
My background is more using SQL for analysis, so while I'm now learning how to read query plans, understand statistics, devise indexes and tune queries it's been a steep learning curve and I'm still very much a novice.
While I have experienced the exploding query log issue in other cases (despite SIMPLE recovery mode), that isn't happening for most of these. Batching could be worth a go though.
It seems like the poor row count estimate when selecting from the view has less of an impact when SELECT-ing INTO a temp table than when INSERT-ing into a large persisted table, so running that step first then the INSERT running with accurate info works well... but I may be looking at that from the wrong angle.
Upvotes: 1
Views: 471
Reputation: 7918
My questions are really: 1) Is this considered good/ acceptable practice?
Yes - acceptable but you haven't provided enough detail to determine if it's a "good" practice. For example,
"However, when I tried adding a step to insert everything from the view into a temp table before another JOIN / filter/ INSERT, it dramatically improved performance - some tables running as much as 85% faster"
The important thing to understand is Why the query improved. All things being equal - pulling data from a table into a temp table then reading from that temp table should be slower in theory because you're doing more things BUT there are a number of reasons why doing it the way you are has improved performance. Here's a few (just off the top of my head):
Creating a new temp table will force a fresh execution plan with fresh stats and better cardinality estimations which often leads to a better plan. Is it still 85% faster when/if you run the original code with OPTION (RECOMPILE).
Temp tables are created in the tempdb... If the tempdb is configured well (e.g. on fast disk, split up correctly, no autogrowth happening) and the view references objects on slow/fragmented/heavily pounded on disk, that would affect things too.
You won't have locking, blocking, lock escalation, deadlocks, etc. reading from your temp table (or a table variable) whereas you can with a view, especially if it's poorly written or gets pounded on often.
Parallelism can greatly improve a query, you'll often see queries that run for minutes with a serial execution plan speed up to seconds or milliseconds with a parallel execution plan. There could be something in the view or underlying data structure, such as a scalar user-defined function that is called as part of the view or as a constraint and/or computed column on one of the underlying tables. Dumping into a a temp table first might circumvent this.
Using the temp table could eliminate parameter sniffing issues that you are having when referencing the views....
I could go on but this is all hypothetical. For a more accurate understanding why your queries are slow/faster, improve or regress - learn to read the execution plan and how to use STATISTICS TIME and STATISTICS IO. You can get deeper by running traces, leveraging perfmon, using Extended Events and/or the Query Store, as well as various third party tools like Idera, Redgate and SentryOne. START WITH UNDERSTANDING EXECUTION PLANS - THE ACTUAL EXECUTION PLAN (in addition to the estimated plans).
2) Is this likely to be scale-able as data volume continues to increase?
Depends on your answers to the first question. Does adding more rows increase, decrease or leave those performance gains unchanged. If doubling the data causes you to now only see a 20% performance gain, try testing with triple the data and see if the trend of reduced returns continues. Conversely, if after doubling the data you now see a 200% performance increase, add more and see if the trend continues.
3) Are there any gotchas with this approach I may not have considered?
Yeah - tons. I don't have time to go over all of them but I'll give you one of my favorite "gotchas" - autogrowth and the exploiding transaction log. But inserts will generate a lot of transaction log activity, especially when the Recovery Model is set FULL Recovery. To handle this I have found it best to batch my inserts, often using a stored procedure that allows me to specify how many rows to do per batch.
Here's a quick example I spun up on how you can batch your data modifications, particularly inserts...
SET NOCOUNT ON;
-- Sample data
------------------------------------------------------------------------------------------
DECLARE @source TABLE(id INT IDENTITY PRIMARY KEY, Col1 CHAR(1))
DECLARE @target TABLE(id INT IDENTITY PRIMARY KEY, Col1 CHAR(1))
INSERT @source (Col1)
SELECT TOP (1000) LEFT(NEWID(),1) -- random letter/number
FROM sys.all_columns
-- User-Defined input parameter:
------------------------------------------------------------------------------------------
DECLARE @batchsize INT = 100;
-- Routine
------------------------------------------------------------------------------------------
DECLARE @i INT = @batchsize
DECLARE @stage TABLE(id INT IDENTITY PRIMARY KEY, Col1 CHAR(1));
INSERT @stage(Col1) SELECT Col1 FROM @source;
WHILE EXISTS (SELECT 1 FROM @stage)
BEGIN
INSERT @target (Col1)
SELECT TOP (@batchsize) s.Col1
FROM @stage AS s
WHERE s.ID <= @batchsize;
DELETE FROM @stage
WHERE ID <= @batchsize;
SELECT @batchsize += @batchsize;
END;
--SELECT * FROM @target;
Upvotes: 1
Reputation: 4058
You can use a var table instead of a temp table, in var table you can define indexes, so you will avoid underlying source view problems.
Try something like this:
DECLARE @SRC TABLE (
IDX INT IDENTITY PRIMARY KEY,
col1 INT,
col2 INT,
col3 INT,
...
...
[date] DATETIME,
UNIQUE (COL1, [DATE], IDX) -- ADD AN INDEX ON COL1 + DATE
)
SELECT *
INTO @SRC
FROM source_view
INSERT INTO dest_tbl
SELECT col1, col2, col3...
FROM @SRC s
LEFT JOIN other_tbl ot on s.col1 = ot.col1
WHERE s.date > ot.prev_date
AND ot.col2 is null
You can also try to change the join condition that way, it depends on indexes on table other_tbl
INSERT INTO dest_tbl
SELECT col1, col2, col3...
FROM @SRC s
LEFT JOIN other_tbl ot on s.col1 = ot.col1 AND s.date > ot.prev_date
WHERE ot.col2 is null
Depending on your server version and your hardware the optimizer can use different plans, so you should also try with other indexes like this:
DECLARE @SRC TABLE (
IDX INT IDENTITY PRIMARY KEY,
col1 INT,
col2 INT,
col3 INT,
...
...
[date] DATETIME,
UNIQUE (COL1, [DATE], IDX) -- ADD AN INDEX ON COL1 + DATE
UNIQUE ([DATE], IDX) -- ADD AN INDEX ON DATE
)
Final hint, you can also extract other_table
rows in a var table filtering COL2
nulls
DECLARE @OTHER TABLE (
IDX INT IDENTITY PRIMARY KEY,
col1 INT,
col2 INT,
col3 INT,
...
...
prev_date DATETIME,
UNIQUE (COL1, prev_date, IDX) -- ADD AN INDEX ON COL1 + prev_date
UNIQUE (prev_date, IDX) -- ADD AN INDEX ON prev_date
)
SELECT *
INTO @OTHER
FROM other_table
WHERE COL2 IS NULL
INSERT INTO dest_tbl
SELECT col1, col2, col3...
FROM @SRC s
LEFT JOIN @OTHER ot on s.col1 = ot.col1 AND s.date > ot.prev_date
Upvotes: 1