keiichi
keiichi

Reputation: 83

XML query very slow, when .value method on XML type contains attribute filter

I have a single XML value. It contains quite a lot of XML, about 13 MB. The file is available publicly at https://www.profinfo.pl/export/bookseller.xml In fact it contains an invalid character, but I remove all such characters before passing it to SQL Server.

The file contains the following XML structure:

<?xml version="1.0" encoding="UTF-8"?>
<products>
    <product>
        <productId><![CDATA[108316]]></productId>
        <productCode><![CDATA[PIP-6500:201903]]></productCode>
        <productDateAdd><![CDATA[2019-03-15 14:54:52]]></productDateAdd>
        <productDateUpdate><![CDATA[2019-03-15 15:07:20]]></productDateUpdate>
        <productPrice><![CDATA[67.00]]></productPrice>
        <productNettoPrice><![CDATA[62.04]]></productNettoPrice>
        <productName><![CDATA[Państwo i Prawo - Nr 3/2019 [877]]]></productName>
        <productState><![CDATA[]]></productState>
        <productRateTax><![CDATA[8]]></productRateTax>
        <productBarcode><![CDATA[]]></productBarcode>
        <productUrl><![CDATA[https://www.profinfo.pl/sklep/panstwo-i-prawo-nr-32019-877,7255,r,2019,nr,3.html]]></productUrl>
        <productImages>
            <productImage><![CDATA[https://static.profinfo.pl/image/core_products/2019/3/15/a6e51ae4589402ae9e895c97eb42671e/admin/shop/0031-0980_877.jpg]]></productImage>
            <productImage><![CDATA[https://static.profinfo.pl/image/core_products/2019/3/15/2b0c1bdae2ca13e745d2860428317e74/admin/shop/0031-0980_877.jpg]]></productImage>
            <productImage><![CDATA[https://static.profinfo.pl/image/core_products/2019/3/15/ae609e0b6df358538f385fe4c668ab61/admin/shop/0031-0980_877.jpg]]></productImage>
        </productImages>
        <productCategories>
            <productCategory ID="144"><![CDATA[Historia, teoria i filozofia prawa]]></productCategory>
        </productCategories>
        <productAuthors>
            <productAuthor><![CDATA[Andrzej Wr&oacute;bel]]></productAuthor>
        </productAuthors>
        <productAttributes>
            <productAttribute name="Format"><![CDATA[]]></productAttribute>
            <productAttribute name="Wydanie"><![CDATA[]]></productAttribute>
            <productAttribute name="Medium"><![CDATA[Czasopismo papier]]></productAttribute>
            <productAttribute name="Rodzaj"><![CDATA[]]></productAttribute>
            <productAttribute name="Rok publikacji"><![CDATA[2019]]></productAttribute>
            <productAttribute name="Spis treści"><![CDATA[&lt;p&gt;&lt;strong&gt;TREŚĆ ZESZYTU 3/2019 &bdquo;PAŃSTWA I PRAWA&rdquo;&lt;/strong&gt;&lt;/p&gt;]]></productAttribute>
            <productAttribute name="Wydawnictwo"><![CDATA[Wolters Kluwer Polska]]></productAttribute>
            <productAttribute name="Okładka"><![CDATA[]]></productAttribute>
            <productAttribute name="Liczba stron"><![CDATA[]]></productAttribute>
            <productAttribute name="Opis katalogowy 500"><![CDATA[]]></productAttribute>
            <productAttribute name="Opis"><![CDATA[]]></productAttribute>
            <productAttribute name="Stan prawny"><![CDATA[]]></productAttribute>
            <productAttribute name="ISBN"><![CDATA[]]></productAttribute>
            <productAttribute name="Wersja publikacji"><![CDATA[Czasopismo papier]]></productAttribute>
            <productAttribute name="KodTowaru"><![CDATA[PIP-6500:201903]]></productAttribute>
            <productAttribute name="Dostępność"><![CDATA[]]></productAttribute>
            <productAttribute name="DrukNaŻyczenie"><![CDATA[0]]></productAttribute>
        </productAttributes>
    </product>
    …
</products>

Now, a want to execute following query in SQL:

SELECT
    product.value('productId[1]', 'VARCHAR(50)') AS id,
    product.value('(productAttributes/productAttribute[@name="Format"])[1]', 'NVARCHAR(255)') AS format
FROM @xml.nodes('/products/product') p(product)

The problem is, that the query is extremely slow. I found the way to overcome this limitation. I can rewrite this query extracting productAttributes node as a separate XML and query only this part:

SELECT
    product.value('productId[1]', 'VARCHAR(50)') AS id,
    attrib.value('productAttribute[@name="Format"][1]', 'NVARCHAR(255)') AS format
FROM @xml.nodes('/products/product') p(product)
    OUTER APPLY (SELECT product.query('productAttributes/productAttribute') AS attrib) a

but still I don't understand why the first query has such low performance.

For testing I limited the size of the returned data to the TOP 100 rows and I tried to compare the plans of both queries.

The most suspicious fragment is the XML Reader with the XPath filter, which returns 3139900 rows.

Query plan

This is equal to 1847 (number of product elements in whole file) times 17 (number of productAttribute elements in each product) times 100 (number of top returned rows). Without limiting this function will return 1847 * 1847 * 17 = 57993953, which I think explains the low performance. It grows quadratically with the number of products in te XML file.

Interestingly, the complexity of the first, much slower query is estimated at 31% of the batch, second at 69%.

Does anyone know why filtering without an additional .query method is so expensive?

Upvotes: 1

Views: 1915

Answers (1)

Yitzhak Khabinsky
Yitzhak Khabinsky

Reputation: 22283

Please try the following approach.

It takes less than one second to process your 13 MB XML file on my machine. I also needed to remove a some kind of special character to make the XML file well-formed.

Points to mention:

  • We are loading entire XML file into a SQL Server table with one single row. All this instead of loading XML file into a variable of XML type in a memory.
  • OUTER APPLY simulates LEFT OUTER JOIN. We need this because XML has two main hierarchies (one-to-many) product-to-productAttributes.
  • Behind the scenes, while dealing with XML, SQL Server creates temporary tables. That's why query execution plan shows Table Value Function with XML reader for each.
  • It is a best practice to use text() at the end of each XPath expressions for XML elements for the XQuery .value() method. It is SQL Server XQuery peculiarity. And I don't see that you are doing that.

SQL

DECLARE @tbl TABLE(
    ID INT IDENTITY(1, 1) PRIMARY KEY,
    XmlColumn XML
);

INSERT INTO @tbl(XmlColumn)
SELECT * FROM OPENROWSET(BULK N'c:\Temp\bookseller.xml', SINGLE_BLOB) AS x;

SELECT c.value('(productId/text())[1]', 'VARCHAR(10)') AS productId
    , x.value('text()[1]', 'VARCHAR(255)') AS [Format]
FROM @tbl CROSS APPLY XmlColumn.nodes('/products/product') AS t1(c)
    OUTER APPLY t1.c.nodes('productAttributes/productAttribute[@name="Format"]') AS t2(x);

Partial Output

+-----------+--------+
| productId | Format |
+-----------+--------+
|    108008 | NULL   |
|    108017 | NULL   |
|    108316 | NULL   |
|    108325 | NULL   |
|    108574 | NULL   |
|    108589 | A5     |
|    108595 | B5     |
+-----------+--------+

Upvotes: 2

Related Questions