Reputation: 34848

The Best Way to shred XML data into SQL Server database columns

What is the best way to shred XML data into various database columns? So far I have mainly been using the nodes and value functions like so:

INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value('(@column1)[1]', 'varchar(20)'),
Rows.n.value('(@column2)[1]', 'nvarchar(100)'),
Rows.n.value('(@column3)[1]', 'int'),
FROM @xml.nodes('//Rows') Rows(n)

However I find that this is getting very slow for even moderate size xml data.

Upvotes: 28

Answers (8)

Dan

Reputation: 841

Stumbled across this question whilst having a very similar problem, I'd been running a query processing a 7.5MB XML file (~approx 10,000 nodes) for around 3.5~4 hours before finally giving up.

However, after a little more research I found that having typed the XML using a schema and created an XML Index (I'd bulk inserted into a table) the same query completed in ~ 0.04ms.

How's that for a performance improvement!

Code to create a schema:

IF EXISTS ( SELECT * FROM sys.xml_schema_collections where [name] = 'MyXmlSchema')
DROP XML SCHEMA COLLECTION [MyXmlSchema]
GO

DECLARE @MySchema XML
SET @MySchema = 
(
    SELECT * FROM OPENROWSET
    (
        BULK 'C:\Path\To\Schema\MySchema.xsd', SINGLE_CLOB 
    ) AS xmlData
)

CREATE XML SCHEMA COLLECTION [MyXmlSchema] AS @MySchema 
GO

Code to create the table with a typed XML column:

CREATE TABLE [dbo].[XmlFiles] (
    [Id] [uniqueidentifier] NOT NULL,

    -- Data from CV element 
    [Data] xml(CONTENT dbo.[MyXmlSchema]) NOT NULL,

CONSTRAINT [PK_XmlFiles] PRIMARY KEY NONCLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

Code to create Index

CREATE PRIMARY XML INDEX PXML_Data
ON [dbo].[XmlFiles] (Data)

There are a few things to bear in mind though. SQL Server's implementation of Schema doesn't support xsd:include. This means that if you have a schema which references other schema, you'll have to copy all of these into a single schema and add that.

Also I would get an error:

XQuery [dbo.XmlFiles.Data.value()]: Cannot implicitly atomize or apply 'fn:data()' to complex content elements, found type 'xs:anyType' within inferred type 'element({http://www.mynamespace.fake/schemas}:SequenceNumber,xs:anyType) ?'.

if I tried to navigate above the node I had selected with the nodes function. E.g.

SELECT
    ,C.value('CVElementId[1]', 'INT') AS [CVElementId]
    ,C.value('../SequenceNumber[1]', 'INT') AS [Level]
FROM 
    [dbo].[XmlFiles]
CROSS APPLY
    [Data].nodes('/CVSet/Level/CVElement') AS T(C)

Found that the best way to handle this was to use the OUTER APPLY to in effect perform an "outer join" on the XML.

SELECT
    ,C.value('CVElementId[1]', 'INT') AS [CVElementId]
    ,B.value('SequenceNumber[1]', 'INT') AS [Level]
FROM 
    [dbo].[XmlFiles]
CROSS APPLY
    [Data].nodes('/CVSet/Level') AS T(B)
OUTER APPLY
    B.nodes ('CVElement') AS S(C)

Hope that that helps someone as that's pretty much been my day.

Upvotes: 57

jccprj

Reputation: 171

in my case i'm running SQL 2005 SP2 (9.0).

The only thing that helped was adding OPTION ( OPTIMIZE FOR ( @your_xml_var = NULL ) ). Explanation is on the link below.

Example:

INSERT INTO @tbl (Tbl_ID, Name, Value, ParamData)
SELECT     1,
    tbl.cols.value('name[1]', 'nvarchar(255)'),
    tbl.cols.value('value[1]', 'nvarchar(255)'),
    tbl.cols.query('./paramdata[1]')
FROM @xml.nodes('//root') as tbl(cols) OPTION ( OPTIMIZE FOR ( @xml = NULL ) )

https://connect.microsoft.com/SQLServer/feedback/details/562092/an-insert-statement-using-xml-nodes-is-very-very-very-slow-in-sql2008-sp1

Upvotes: 5

edhubbell

Reputation: 2258

We had a similar issue here. Our DBA (SP, you the man) took a look at my code, made a little tweak to the syntax, and we got the speed we had been expecting. It was unusual because my select from XML was plenty fast, but the insert was way slow. So try this syntax instead:

INSERT INTO some_table (column1, column2, column3)
    SELECT 
        Rows.n.value(N'(@column1/text())[1]', 'varchar(20)'), 
        Rows.n.value(N'(@column2/text())[1]', 'nvarchar(100)'), 
        Rows.n.value(N'(@column3/text())[1]', 'int')
    FROM @xml.nodes('//Rows') Rows(n)

So specifying the text() parameter really seems to make a difference in performance. Took our insert of 2K rows from 'I must have written that wrong - let me stop it' to about 3 seconds. Which was 2x faster than the raw insert statements we had been running through the connection.

Upvotes: 3

Tao

Reputation: 14006

I wouldn't claim this is the "best" solution, but I've written a generic SQL CLR procedure for this exact purpose - it takes a "tabular" Xml structure (such as that returned by FOR XML RAW) and outputs a resultset.

It does not require any customization / knowledge of the structure of the "table" in the Xml, and turns out to be extremely fast / efficient (although this wasn't a design goal). I just shredded a 25MB (untyped) xml variable in under 20 seconds, returning 25,000 rows of a pretty wide table.

Hope this helps someone: http://architectshack.com/ClrXmlShredder.ashx

Upvotes: 2

Eddie Groves

Reputation: 34848

My current solution for large XML sets (> 500 nodes) is to use SQL Bulk Copy (System.Data.SqlClient.SqlBulkCopy) by using a DataSet to load the XML into memory and then pass the table to SqlBulkCopy (defining a XML schema helps).

Obviously there a pitfalls such as needlessly using a DataSet and loading the whole document into memory first. I would like to go further in the future and implement my own IDataReader to bypass the DataSet method but currently the DataSet is "good enough" for the job.

Basically I never found a solution to my original question regarding the slow performance for that type of XML shredding. It could be slow due to the typed xml queries being inherently slow or something to do with transactions and the the SQL Server log. I guess the typed xml functions were never designed for operating on non-trivial node sizes.

XML Bulk Load: I tried this and it was fast but I had trouble getting the COM dll to work under 64bit environments and I generally try to avoid COM dlls that no longer appear to be supported.

sp_xml_preparedocument/OPENXML: I never went down this road so would be interested to see how it performs.

Upvotes: 0

si618

Reputation: 16848

There is an XML Bulk load COM object (.NET Example)

From MSDN:

You can insert XML data into a SQL Server database by using an INSERT statement and the OPENXML function; however, the Bulk Load utility provides better performance when you need to insert large amounts of XML data.

Upvotes: 0

DannykPowell

Reputation: 1227

This isn't an answer, more an addition to this question - I have just come across the same problem and I can give figures as edg asks for in the comment.

My test has xml which results in 244 records being inserted - so 244 nodes.

The code that I am rewriting takes on average 0.4 seconds to run.(10 tests run, spread from .56 secs to .344 secs) Performance is not the main reason the code is being rewritten, but the new code needs to perform as well or better. This old code loops the xml nodes, calling a sp to insert once per loop

The new code is pretty much just a single sp; pass the xml in; shred it.

Tests with the new code switched in show the new sp takes on average 3.7 seconds - almost 10 times slower.

My query is in the form posted in this question;

INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value('(@column1)[1]', 'varchar(20)'),
Rows.n.value('(@column2)[1]', 'nvarchar(100)'),
Rows.n.value('(@column3)[1]', 'int'),
FROM @xml.nodes('//Rows') Rows(n)

The execution plan appears to show that for each column, sql server is doing a separate "Table Valued Function [XMLReader]" returning all 244 rows, joining all back up with Nested Loops(Inner Join). So In my case where I am shredding from/ inserting into about 30 columns, this appears to happen separately 30 times.

I am going to have to dump this code, I don't think any optimisation is going to get over this method being inherently slow. I am going to try the sp_xml_preparedocument/OPENXML method and see if the performance is better for that. If anyone comes across this question from a web search (as I did) I would highly advise you to do some performance testing before using this type of shredding in SQL Server

Upvotes: 1

aku

Reputation: 123974

I'm not sure what is the best method. I used OPENXML construction:

INSERT INTO Test
SELECT Id, Data 
FROM OPENXML (@XmlDocument, '/Root/blah',2)
WITH (Id   int         '@ID',
      Data varchar(10) '@DATA')

To speed it up, you can create XML indices. You can set index specifically for value function performance optimization. Also you can use typed xml columns, which performs better.

Upvotes: 3

The Best Way to shred XML data into SQL Server database columns

Answers (8)

Related Questions