Reputation: 83
I have a single XML value. It contains quite a lot of XML, about 13 MB. The file is available publicly at https://www.profinfo.pl/export/bookseller.xml In fact it contains an invalid character, but I remove all such characters before passing it to SQL Server.
The file contains the following XML structure:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<productId><![CDATA[108316]]></productId>
<productCode><![CDATA[PIP-6500:201903]]></productCode>
<productDateAdd><![CDATA[2019-03-15 14:54:52]]></productDateAdd>
<productDateUpdate><![CDATA[2019-03-15 15:07:20]]></productDateUpdate>
<productPrice><![CDATA[67.00]]></productPrice>
<productNettoPrice><![CDATA[62.04]]></productNettoPrice>
<productName><![CDATA[Państwo i Prawo - Nr 3/2019 [877]]]></productName>
<productState><![CDATA[]]></productState>
<productRateTax><![CDATA[8]]></productRateTax>
<productBarcode><![CDATA[]]></productBarcode>
<productUrl><![CDATA[https://www.profinfo.pl/sklep/panstwo-i-prawo-nr-32019-877,7255,r,2019,nr,3.html]]></productUrl>
<productImages>
<productImage><![CDATA[https://static.profinfo.pl/image/core_products/2019/3/15/a6e51ae4589402ae9e895c97eb42671e/admin/shop/0031-0980_877.jpg]]></productImage>
<productImage><![CDATA[https://static.profinfo.pl/image/core_products/2019/3/15/2b0c1bdae2ca13e745d2860428317e74/admin/shop/0031-0980_877.jpg]]></productImage>
<productImage><![CDATA[https://static.profinfo.pl/image/core_products/2019/3/15/ae609e0b6df358538f385fe4c668ab61/admin/shop/0031-0980_877.jpg]]></productImage>
</productImages>
<productCategories>
<productCategory ID="144"><![CDATA[Historia, teoria i filozofia prawa]]></productCategory>
</productCategories>
<productAuthors>
<productAuthor><![CDATA[Andrzej Wróbel]]></productAuthor>
</productAuthors>
<productAttributes>
<productAttribute name="Format"><![CDATA[]]></productAttribute>
<productAttribute name="Wydanie"><![CDATA[]]></productAttribute>
<productAttribute name="Medium"><![CDATA[Czasopismo papier]]></productAttribute>
<productAttribute name="Rodzaj"><![CDATA[]]></productAttribute>
<productAttribute name="Rok publikacji"><![CDATA[2019]]></productAttribute>
<productAttribute name="Spis treści"><![CDATA[<p><strong>TREŚĆ ZESZYTU 3/2019 „PAŃSTWA I PRAWA”</strong></p>]]></productAttribute>
<productAttribute name="Wydawnictwo"><![CDATA[Wolters Kluwer Polska]]></productAttribute>
<productAttribute name="Okładka"><![CDATA[]]></productAttribute>
<productAttribute name="Liczba stron"><![CDATA[]]></productAttribute>
<productAttribute name="Opis katalogowy 500"><![CDATA[]]></productAttribute>
<productAttribute name="Opis"><![CDATA[]]></productAttribute>
<productAttribute name="Stan prawny"><![CDATA[]]></productAttribute>
<productAttribute name="ISBN"><![CDATA[]]></productAttribute>
<productAttribute name="Wersja publikacji"><![CDATA[Czasopismo papier]]></productAttribute>
<productAttribute name="KodTowaru"><![CDATA[PIP-6500:201903]]></productAttribute>
<productAttribute name="Dostępność"><![CDATA[]]></productAttribute>
<productAttribute name="DrukNaŻyczenie"><![CDATA[0]]></productAttribute>
</productAttributes>
</product>
…
</products>
Now, a want to execute following query in SQL:
SELECT
product.value('productId[1]', 'VARCHAR(50)') AS id,
product.value('(productAttributes/productAttribute[@name="Format"])[1]', 'NVARCHAR(255)') AS format
FROM @xml.nodes('/products/product') p(product)
The problem is, that the query is extremely slow.
I found the way to overcome this limitation. I can rewrite this query extracting productAttributes
node as a separate XML and query only this part:
SELECT
product.value('productId[1]', 'VARCHAR(50)') AS id,
attrib.value('productAttribute[@name="Format"][1]', 'NVARCHAR(255)') AS format
FROM @xml.nodes('/products/product') p(product)
OUTER APPLY (SELECT product.query('productAttributes/productAttribute') AS attrib) a
but still I don't understand why the first query has such low performance.
For testing I limited the size of the returned data to the TOP 100 rows and I tried to compare the plans of both queries.
The most suspicious fragment is the XML Reader with the XPath filter, which returns 3139900 rows.
This is equal to 1847 (number of product
elements in whole file) times 17 (number of productAttribute
elements in each product) times 100 (number of top returned rows).
Without limiting this function will return 1847 * 1847 * 17 = 57993953, which I think explains the low performance. It grows quadratically with the number of products in te XML file.
Interestingly, the complexity of the first, much slower query is estimated at 31% of the batch, second at 69%.
Does anyone know why filtering without an additional .query
method is so expensive?
Upvotes: 1
Views: 1915
Reputation: 22283
Please try the following approach.
It takes less than one second to process your 13 MB XML file on my machine. I also needed to remove a some kind of special character to make the XML file well-formed.
Points to mention:
OUTER APPLY
simulates LEFT OUTER JOIN
. We need this because XML has two main hierarchies (one-to-many) product-to-productAttributes.text()
at the end of each XPath
expressions for XML elements for the XQuery .value()
method. It is SQL Server XQuery peculiarity. And I don't see that you are doing that.SQL
DECLARE @tbl TABLE(
ID INT IDENTITY(1, 1) PRIMARY KEY,
XmlColumn XML
);
INSERT INTO @tbl(XmlColumn)
SELECT * FROM OPENROWSET(BULK N'c:\Temp\bookseller.xml', SINGLE_BLOB) AS x;
SELECT c.value('(productId/text())[1]', 'VARCHAR(10)') AS productId
, x.value('text()[1]', 'VARCHAR(255)') AS [Format]
FROM @tbl CROSS APPLY XmlColumn.nodes('/products/product') AS t1(c)
OUTER APPLY t1.c.nodes('productAttributes/productAttribute[@name="Format"]') AS t2(x);
Partial Output
+-----------+--------+
| productId | Format |
+-----------+--------+
| 108008 | NULL |
| 108017 | NULL |
| 108316 | NULL |
| 108325 | NULL |
| 108574 | NULL |
| 108589 | A5 |
| 108595 | B5 |
+-----------+--------+
Upvotes: 2