David F. Severski
David F. Severski

Reputation: 493

Parsing large XML files into PowerShell objects

I'm a PowerShell and XPath beginner struggling to efficiently parse through some XML and build up an array of objects for further processing (e.g. CSV output, SQL Server load). A sample of the XML is included below as well as the code snippet that I'm currently using. In this schema, each object-array represents a single row in the desired output. I'm parsing the MetaData children to get the proper names of the columns, then building a collection of PSObjects where each object in the array represents a single row. The MetaData information is used to find the column names (PSObject properties).

This works fine for files with 10K rows or so, but bogs down horribly when run against my largest files with over 500K rows. In these cases each row is taking around 3-4 seconds to process. At 500K rows, that's a looong time to run. Is there some magic around XPath or PS variable assignment that I can use to speed this up?

The immediate need to is translate this XML into a CSV (currently performed via export-csv), but I'd prefer to have this portion of the script generate a collection of objects as I'll next be looking to either load this data into a SQL Server instance or do other processing.

Thanks for the help!

David

Sample XML

<Report>
<Data>
<Columns>
<MetaData>
<Index>0</Index>
<Name>Column1</Name>
<Index>1</Index>
<Name>Column2</Name>
<Index>2</Index>
<Name>Column3</Name>
</MetaData>
</Columns>
<Rows>
<object-array>
<string>column1 value</string>
<int>column2 value</string>
<string>column3 value</string>
</object-array>
</Rows>
</Data>
</Report>

Sample Code

#extract the column headers
[string[]]$ColumnHeaders = @()
$obj.SelectNodes("/Report/Data/Columns/MetaData") |% {$ColumnHeaders += $_.name}

$collection = @()
$rowint = 0
$rowcount = $obj.Report.Data.Rows."object-array".count

#unwind the rows
do {
    $hash=@{}

    #loop through each element in the row parent element and add it to the hash
    $columnint = 0
    $columncount = (Select-Xml -xPath "Report/Data/Rows/object-array[$rowint]/node()" $obj).count
        do {
            $hash.Add($columnheaders[$columnint], (Select-Xml -xPath "Report/Data/Rows/object-array[$rowint]/descendant::text()[$columnint]" $obj).Node.Value)
            $columnint++
        } while ($columnint -lt $columncount)


    $thisrow = New-Object PSObject -Property $hash 

    #add this new row to the collection 
    $collection += $thisrow 
    $rowint++
} while ($rowint -lt $rowcount)

Upvotes: 4

Views: 4752

Answers (1)

Shay Levy
Shay Levy

Reputation: 126722

You can get MetaData names without re-creating ColumnHeaders in each itreation:

$ColumnHeaders = $obj.Report.Data.Columns.MetaData.Name

Same applies to $collection. How the end result of your code looks like?

UPDATE: Give this a try

[xml]$obj = Get-Content test.xml

$data = $obj.Report.Data

$pso = New-Object PSObject
$pso | Add-Member NoteProperty -Name $data.Columns.MetaData.Name[0] -Value $data.Rows.'object-array'.string[0]
$pso | Add-Member NoteProperty -Name $data.Columns.MetaData.Name[1] -Value $data.Rows.'object-array'.int
$pso | Add-Member NoteProperty -Name $data.Columns.MetaData.Name[2] -Value $data.Rows.'object-array'.string[1] -PassThru

Upvotes: 1

Related Questions