How can I make Powershell parse XML faster or further optimize my script?

I have a setup that contains 7 million XML files, varying in size from a few KB to multiple MB. All in all, it's about 180GB of XML files. The job I need performed is to analyze each XML file and determine if the file contains the string <ref>, and if it does not to move it out of the Chunk folder that it currently is contained in to the Referenceless folder.

The script I have created works well enough, but it's extremely slow for my purposes. It's slated to finish analyzing all 7 million files in about 24 days, going at a rate of about 3 files per second. Is there anything I can change in my script to eek out more performance?

Also, to make matters even more complicated, I do not have the correct permissions on my server box to run .PS1 files, and so the script will need to be able to be run from the PowerShell in one command. I would set the permissions if I had the authorization to.

# This script will iterate through the Chunk folders, removing pages that contain no 
# references and putting them into the Referenceless folder.

# Change this variable to start the program on a different chunk. This is the first   
# command to be run in Windows PowerShell. 
$chunknumber = 1
#This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113.
while($chunknumber -le 113){
#Jumps the terminal to the correct folder.
cd C:\Wiki_Pages
#Creates an index for the chunk being worked on.
$items = Get-ChildItem -Path "Chunk_$chunknumber"
echo "Chunk $chunknumber Indexed"
#Jumps to chunk folder.
cd C:\Wiki_Pages\Chunk_$chunknumber
#Loops through the index. Each entry is one of the pages.
foreach ($page in $items){
#Creates a variable holding the page's content.
$content = Get-Content $page
#If the page has a reference, then it's echoed.
if($content | Select-String "<ref>" -quiet){echo "Referenced!"}
#if the page doesn't have a reference, it's copied to Referenceless then deleted.
else{
Copy-Item $page C:\Wiki_Pages\Referenceless -force
Remove-Item $page -force
echo "Moved to Referenceless!"
}
}
#The chunk number is increased by one and the cycle continues.
$chunknumber = $chunknumber + 1
}

I have very little knowledge of PowerShell, yesterday was the first time I had ever even opened the program.

Upvotes: 4

Answers (5)

Bernard Moeskops

Reputation: 1283

If you would load the xml into a variable, it is also significantly faster then Get-Content.

Measure-Command {
    $xml = [xml]''
    $xml.Load($xmlFilePath)
}

Measure-Command {
    [xml]$xml = Get-Content $xmlFilePath -ReadCount 0
}

In my measurements it's at least 4 times faster.

Upvotes: 2

SpellingD

Reputation: 2621

You will want to add the -ReadCount 0 argument to your Get-Content commands to speed them up (it helps tremendously). I learned this tip from this great article that shows running a foreach over a whole file's contents is faster than trying to parse it through a pipeline.

Also, you can use Set-ExecutionPolicy Bypass -Scope Process in order to run scripts in your current Powershell session, without needing extra permissions!

Upvotes: 4

Sean Glover

Reputation: 1786

The PowerShell pipeline can be markedly slower than native system calls.

PowerShell: pipeline performance

In this article a performance test is performed between two equivalent commands executed on PowerShell and a classical windows command prompt.

PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"

Here's a sample of its output.

PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }

10 iterations

   30 ms  (   0 lines / ms)  grep in PS
   15 ms  (   1 lines / ms)  grep in cmd.exe

100 iterations

   28 ms  (   4 lines / ms)  grep in PS
   12 ms  (   8 lines / ms)  grep in cmd.exe

1000 iterations

  147 ms  (   7 lines / ms)  grep in PS
   11 ms  (  89 lines / ms)  grep in cmd.exe

10000 iterations

 1347 ms  (   7 lines / ms)  grep in PS
   13 ms  ( 786 lines / ms)  grep in cmd.exe

100000 iterations

13410 ms  (   7 lines / ms)  grep in PS
   22 ms  (4580 lines / ms)  grep in cmd.exe

EDIT: The original answer to this question mentioned pipeline performance along with some other suggestions. To keep this post succinct I've removed the other suggestions that didn't actually have anything to do with pipeline performance.

Upvotes: 2

Chris N

Reputation: 7499

I would experiment with parsing 5 files at once using the Start-Job cmdlet. There are many excellent articles on PowerShell Jobs. If for some reason that doesn't help, and you're experiencing I/O or actual resource bottlenecks, you could even use Start-Job and WinRM to spin up workers on other machines.

Upvotes: 0

alroc

Reputation: 28174

Before you start optimizing, you need to determine exactly where you need to optimize. Are you I/O bound (how long it takes to read each file)? Memory bound (likely not)? CPU bound (time to search the content)?

You say these are XML files; have you tested reading the files into an XML object (instead of plain text), and locating the <ref> node via XPath? You would then have:

$content = [xml](Get-Content $page)
#If the page has a reference, then it's echoed.
if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}

If you have CPU, memory & I/O resources to spare, you may see some improvement by searching multiple files in parallel. See this discussion on running several jobs in parallel. Obviously you can't run a large number simultaneously, but with some testing you can find the sweet spot (probably in the neighborhood of 3-5). Everything inside foreach ($page in $items){ would be the scriptblock for the job.

Upvotes: 0

How can I make Powershell parse XML faster or further optimize my script?

Answers (5)

Related Questions