Optimizing a XQuery-heavy SQL query used in SSIS data flow task

Question

This query is used to import around 600,000 rows into a target table in a data warehouse every night. The target table is truncated before each import.

SELECT -- Around 70 fields from MainTable which contains around 600,000 rows
-- Around 150 fields from around 50 various tables, some quite big
-- Around 35 fields from XQuery derived table queries such as dt_EXTERNAL_CODE1
FROM MainTable
LEFT JOIN -- Around 50 tables
LEFT JOIN
(
SELECT df.ParentID,
ISNULL(df2.XMLValue.value('(Item/*[local-name()="CustomData"]/root/A/Number)[1]', 'float'),0) AS a, 
ISNULL(df2.XMLValue.value('(Item/*[local-name()="CustomData"]/root/B/Number)[1]', 'float'),0) AS b,
ISNULL(df2.XMLValue.value('(Item/*[local-name()="CustomData"]/root/C/Number)[1]', 'float'),0) AS c,
ISNULL(df2.XMLValue.value('(Item/*[local-name()="CustomData"]/root/D/Number)[1]', 'float'),0) AS d,
ISNULL(df2.XMLValue.value('(Item/*[local-name()="CustomData"]/root/E/Number)[1]', 'float'),0) AS e,
ISNULL(df2.XMLValue.value('(Item/*[local-name()="CustomData"]/root/F/Number)[1]', 'float'),0) AS f
FROM DynamicField df
INNER JOIN DynamicField df1 ON df.DynamicFieldID = df1.DynamicFieldID
INNER JOIN DynamicField df2 ON df1.DynamicFieldID = df2.ParentID
WHERE df2.XMLValue.value('(Item/*[local-name()="ExternalCode"])[1]', 'nvarchar(50)') IN('EXTERNAL_CODE1')
) dt_EXTERNAL_CODE1 ON MainTable.DynamicFieldID = dt_EXTERNAL_CODE1.ParentID
LEFT JOIN -- 6 more like the derived table query above, but with some other external code

The SSIS import job takes about 10 hours to complete. Any suggestions on how to optimize this query? The joins cannot be inner joins.

trenton-ftw · Accepted Answer

I just want to mention, this advice is without considering possible indexes that could be added to your XML in the database. Depending on the other six queries that you didn't show and many other factors, indexing the XML data might be a good thing to do as well. The advice I gave here is really just general X-query advice that would apply to almost any X-query expressions.

Also worth noting that storing and using XML to filter like this in a SQL database along with relational data is a BAD idea, especially when you plan on doing large scale ETL solutions using that data. As you've already experienced, it is going to be a hassle. If you are still at the stage where this could be changed, I would STRONGLY advise that you do so.

That aside, here a couple of suggestions:

First, the filter expression WHERE df2.XMLValue.value('(Item/*[local-name()="ExternalCode"])[1]', 'nvarchar(50)') IN('EXTERNAL_CODE1') should be rewritten to use the exist operator (Microsoft doc here). Per Microsoft (here):

For performance reasons, instead of using the value() method in a predicate to compare with a relational value, use exist() with sql:column()

Second, I would move the newly created exist expression to the join clause instead of the WHERE clause. When I view this query, the optimizer might be applying your filter to the entire DynamicField df2 table prior to actually performing the join. Depending on the cardinality of these joins, that could be really detrimental to performance. I'd imagine you want this filter to only be performed for rows returned from your expression FROM DynamicField df INNER JOIN DynamicField df1 ON df.DynamicFieldID = df1.DynamicFieldID. The main point here is Reducing the number of records that are going to use any XML filtering is going to seriously help performance.

Third, each call to value() is going to instantiate a new XML reader that is required to traverse the path (Item/*[local-name()="CustomData"]/root/D/Number). Reducing the amount of work that each instance of a XML reader has to do to retrieve the value required for your SELECT will go a long way for performance. If you have a repetitive path you are traversing (like your example) you would likely be better off using an additional OUTER APPLY operator to a query call to retrieve the XML element root as a separate node, then using that new node in your value statements in the final SELECT. Something like this:

SELECT 
   df.ParentID
    ,ISNULL(root.RootXmlFrag.value('(root/A/Number)[1]', 'float'),0) AS a
    ...... 
FROM 
    DynamicField df
    INNER JOIN DynamicField df1 ON df.DynamicFieldID = df1.DynamicFieldID
    INNER JOIN DynamicField df2 ON df1.DynamicFieldID = df2.ParentID
    OUTER APPLY df2.XMLValue.query('(Item/*[local-name()="CustomData"]/root)[1]') AS root(RootXmlFrag)

The actual final query path expression might vary, but working with the idea that you don't want to have to traverse a complicated path like (Item/*[local-name()="CustomData"]/root/D/Number) for each value expression will certainly help performance in the end.

Optimizing a XQuery-heavy SQL query used in SSIS data flow task

Answers (1)

Related Questions