Reputation: 41
I have a setup that contains 7 million XML files, varying in size from a few KB to multiple MB. All in all, it's about 180GB of XML files. The job I need performed is to analyze each XML file and determine if the file contains the string <ref>
, and if it does not to move it out of the Chunk folder that it currently is contained in to the Referenceless folder.
The script I have created works well enough, but it's extremely slow for my purposes. It's slated to finish analyzing all 7 million files in about 24 days, going at a rate of about 3 files per second. Is there anything I can change in my script to eek out more performance?
Also, to make matters even more complicated, I do not have the correct permissions on my server box to run .PS1 files, and so the script will need to be able to be run from the PowerShell in one command. I would set the permissions if I had the authorization to.
# This script will iterate through the Chunk folders, removing pages that contain no
# references and putting them into the Referenceless folder.
# Change this variable to start the program on a different chunk. This is the first
# command to be run in Windows PowerShell.
$chunknumber = 1
#This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113.
while($chunknumber -le 113){
#Jumps the terminal to the correct folder.
cd C:\Wiki_Pages
#Creates an index for the chunk being worked on.
$items = Get-ChildItem -Path "Chunk_$chunknumber"
echo "Chunk $chunknumber Indexed"
#Jumps to chunk folder.
cd C:\Wiki_Pages\Chunk_$chunknumber
#Loops through the index. Each entry is one of the pages.
foreach ($page in $items){
#Creates a variable holding the page's content.
$content = Get-Content $page
#If the page has a reference, then it's echoed.
if($content | Select-String "<ref>" -quiet){echo "Referenced!"}
#if the page doesn't have a reference, it's copied to Referenceless then deleted.
else{
Copy-Item $page C:\Wiki_Pages\Referenceless -force
Remove-Item $page -force
echo "Moved to Referenceless!"
}
}
#The chunk number is increased by one and the cycle continues.
$chunknumber = $chunknumber + 1
}
I have very little knowledge of PowerShell, yesterday was the first time I had ever even opened the program.
Upvotes: 4
Views: 4329
Reputation: 1283
If you would load the xml into a variable, it is also significantly faster then Get-Content.
Measure-Command {
$xml = [xml]''
$xml.Load($xmlFilePath)
}
Measure-Command {
[xml]$xml = Get-Content $xmlFilePath -ReadCount 0
}
In my measurements it's at least 4 times faster.
Upvotes: 2
Reputation: 2621
You will want to add the -ReadCount 0
argument to your Get-Content
commands to speed them up (it helps tremendously). I learned this tip from this great article that shows running a foreach
over a whole file's contents is faster than trying to parse it through a pipeline.
Also, you can use Set-ExecutionPolicy Bypass -Scope Process
in order to run scripts in your current Powershell session, without needing extra permissions!
Upvotes: 4
Reputation: 1786
The PowerShell pipeline can be markedly slower than native system calls.
PowerShell: pipeline performance
In this article a performance test is performed between two equivalent commands executed on PowerShell and a classical windows command prompt.
PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"
Here's a sample of its output.
PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }
10 iterations
30 ms ( 0 lines / ms) grep in PS
15 ms ( 1 lines / ms) grep in cmd.exe
100 iterations
28 ms ( 4 lines / ms) grep in PS
12 ms ( 8 lines / ms) grep in cmd.exe
1000 iterations
147 ms ( 7 lines / ms) grep in PS
11 ms ( 89 lines / ms) grep in cmd.exe
10000 iterations
1347 ms ( 7 lines / ms) grep in PS
13 ms ( 786 lines / ms) grep in cmd.exe
100000 iterations
13410 ms ( 7 lines / ms) grep in PS
22 ms (4580 lines / ms) grep in cmd.exe
EDIT: The original answer to this question mentioned pipeline performance along with some other suggestions. To keep this post succinct I've removed the other suggestions that didn't actually have anything to do with pipeline performance.
Upvotes: 2
Reputation: 7499
I would experiment with parsing 5 files at once using the Start-Job cmdlet. There are many excellent articles on PowerShell Jobs. If for some reason that doesn't help, and you're experiencing I/O or actual resource bottlenecks, you could even use Start-Job and WinRM to spin up workers on other machines.
Upvotes: 0
Reputation: 28174
Before you start optimizing, you need to determine exactly where you need to optimize. Are you I/O bound (how long it takes to read each file)? Memory bound (likely not)? CPU bound (time to search the content)?
You say these are XML files; have you tested reading the files into an XML object (instead of plain text), and locating the <ref>
node via XPath? You would then have:
$content = [xml](Get-Content $page)
#If the page has a reference, then it's echoed.
if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}
If you have CPU, memory & I/O resources to spare, you may see some improvement by searching multiple files in parallel. See this discussion on running several jobs in parallel. Obviously you can't run a large number simultaneously, but with some testing you can find the sweet spot (probably in the neighborhood of 3-5). Everything inside foreach ($page in $items){
would be the scriptblock for the job.
Upvotes: 0