SQLFox
SQLFox

Reputation: 289

Get a distinct list of XML nodes from large file

I'm trying to get a distinct list of node names from an XML file. I've been successful with a recursive CTE similar to https://stackoverflow.com/a/2274091/1735928, but only with my files below about 1m characters. Above that, the query never seems to return. Some of my files are in the neighborhood of 100m characters.

I've since switched to trying PowerShell. For this samle XML:

<?xml version="1.0" encoding="UTF_8"?>
<root>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>1111111111</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>0</descendant2>
  </childA>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>2</descendant2>
  </childA>
  <childB>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant3>0</descendant3>
  </childB>
  <childC>
    <descendant4>0</descendant4>
  </childC>
  <childC>
    <descendant4>6</descendant4>
  </childC>
</root>

I've gotten as far as:

$xml.childnodes[1].childnodes | select -uniq | foreach { $xml.childnodes[1].($_.name).childnodes.name | select -uniq }

which gives me:

descendant1
descendant2
descendant1
descendant3
descendant4

but that doesn't include the further descendants. Ultimately, I'm trying to get a table back to SQL that looks like this:

root | childA | descendant1
root | childA | descendant1_1
root | childA | descendant1_1_1
root | childA | descendant2
root | childB | descendant1
root | childB | descendant1_1
root | childB | descendant1_1_1
root | childB | descendant3
root | childC | descendant4

Upvotes: 1

Views: 922

Answers (2)

Mike Twc
Mike Twc

Reputation: 2355

Below is a naive solution, assuming you know the depth of your xml. But probably you can do entire thing with xqury and thus do it on SQL side

[xml]$x = "your xml here"

# ------ LEVEL 2 children

$L2 = $x | Select-Xml "//root/*/*"

foreach($n in $L2) { 
 $L1 = $n.node.ParentNode.LocalName
 $CHILD = $n.node.localname
 [PSCustomObject]@{L1=$L1; CHILD = $CHILD}
}

# ------ LEVEL 3 children 


$L3 = $x | Select-Xml "//root/*/*/*"

foreach($n in $L3) { 
 $L1 = $n.node.ParentNode.ParentNode.LocalName
 $CHILD = $n.node.localname
 [PSCustomObject]@{L1=$L1; CHILD = $CHILD}
}

# ------ LEVEL 4 children 


$L4 = $x | Select-Xml "//root/*/*/*/*"

foreach($n in $L4) { 
 $L1 = $n.node.ParentNode.ParentNode.ParentNode.LocalName
 $CHILD = $n.node.localname
 [PSCustomObject]@{L1=$L1; CHILD = $CHILD}

}

Also adding sql xquery version. It still requires knowledge of the structure and only do one level at the time, but it doesnt have cross apply/joins so probably will work better on huge files

select
 T.c.query('local-name(.)') as self
 ,T.c.query('local-name(..)') as parent
 ,T.c.query('local-name(../..)') as Gparent
  ,T.c.query('local-name(../../..)') as GGparent
from @x.nodes('/root/*/*/*/*') T(c)

Upvotes: 1

Andrei Odegov
Andrei Odegov

Reputation: 3429

If we are talking about hard-coded solutions to this problem, then here is my solution using SQL Server.

DECLARE @x XML = '
<root>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>1111111111</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>0</descendant2>
  </childA>
  <childA>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant2>2</descendant2>
  </childA>
  <childB>
    <descendant1>
      <descendant1_1>
        <descendant1_1_1>2222222222</descendant1_1_1>
      </descendant1_1>
    </descendant1>
    <descendant3>0</descendant3>
  </childB>
  <childC>
    <descendant4>0</descendant4>
  </childC>
  <childC>
    <descendant4>6</descendant4>
  </childC>
</root>
';

SELECT
  x.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') root,
  L2.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L2,
  L3.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L3,
  L4.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L4,
  L5.n.value('fn:local-name(.)', 'NVARCHAR(MAX)') L5
FROM @x.nodes('/*') x(n)
OUTER APPLY x.n.nodes('*') L2(n)
OUTER APPLY L2.n.nodes('*') L3(n)
OUTER APPLY L3.n.nodes('*') L4(n)
OUTER APPLY L4.n.nodes('*') L5(n);

Output

+------+--------+-------------+---------------+-----------------+
| root |   L2   |     L3      |      L4       |       L5        |
+------+--------+-------------+---------------+-----------------+
| root | childA | descendant1 | descendant1_1 | descendant1_1_1 |
| root | childA | descendant2 |               |                 |
| root | childA | descendant1 | descendant1_1 | descendant1_1_1 |
| root | childA | descendant2 |               |                 |
| root | childB | descendant1 | descendant1_1 | descendant1_1_1 |
| root | childB | descendant3 |               |                 |
| root | childC | descendant4 |               |                 |
| root | childC | descendant4 |               |                 |
+------+--------+-------------+---------------+-----------------+

Upvotes: 1

Related Questions