Reputation: 999
I have written a simple PowerShell script that reads data from a master CSV file and parses a URL column within.
It then checks the local file system to test if a binary file exists (in this case a WAV or AIFF file) within a subdirectory of the script root path, and if it does, appends the row meta data with a "Local File" column appended to a new CSV file.
The script works perfectly fine and executes in about 10 seconds with the master CSV file containing a sample of about 2000 rows, however a client in Korea is saying that the script slows down and eventually times out with a master CSV containing roughly 60000 rows.
A sample of the CSV I am parsing can be found here.
My gut feeling is that this is a memory issue caused by the $NewCSVObject
however since this one of my first PS scripts, it very well could be just bad coding on my part.
function Get-ScriptDirectory {
Split-Path -parent $PSCommandPath
}
function Resolve-Local-CSV {
Write-Host "Generating Local CSV File ... Please wait ..."
$currentPath = Get-ScriptDirectory
$csvfile = "$currentPath{0}media_local_master.csv" -f [IO.Path]::DirectorySeparatorChar
$masterCSV = Import-Csv "$csvfile" -header _id, active, track_title, Download_URL, artist_name, album_title, composer, duration, publisher, description, url_path, beats_per_minute, file_path_uncompressed, preview_url, genres, moods, styles, instruments, keywords, last_modified -delimiter ','
$NewCSVObject = @()
$masterCSV | ForEach-Object {
$downloadUrl = $_.Download_URL
if ($downloadUrl -ne 'Download_URL') {
$parsedurl = $downloadUrl.split("/")
$artist = $parsedurl[4]
$album = $parsedurl[5]
$track = $parsedurl[6]
$albumTarget = "$currentPath{0}media{0}$artist{0}$album{0}" -f [IO.Path]::DirectorySeparatorChar
$trackTarget = "$albumTarget$track"
If ((test-path $trackTarget)) {
$localfile = "media{0}$artist{0}$album{0}$track" -f [IO.Path]::DirectorySeparatorChar
$_ | Add-Member -MemberType NoteProperty -Name "Local_File" -Value $localfile
$NewCSVObject += $_
}
}
$newLocalMaster = "$currentPath{0}media_local_master_final.csv" -f [IO.Path]::DirectorySeparatorChar
$NewCSVObject | Export-Csv "$newLocalMaster" -notype
}
Write-Host "Finished Generating Local CSV File ..."
}
Resolve-Local-CSV
Upvotes: 0
Views: 199
Reputation: 2935
It's worth noting that while there are efficiencies that could be gained in code changes in your script, they might be minimal if the the customer is running into a hardware limitation: if the machine the customer is running your code on is paging, then the massive slow-down is most likely related to the OS paging to-from disk.
For a code efficiency standpoint, I've been in the position of doing a large number of high-quantity PowerShell operations (100s of millions files, DB records, LDAP objects, etc.). As such, I've done a fair amount of baselining the relative efficiencies of different ways to perform different operations (speed, memory size, searchability, etc.). It's worth noting that I've run these baselines on multiple different hardware platforms with varying strengths and while the results vary relative to hardware, the conclusions are always the same (see below). Here is a small example of some of those:
Loop Efficiencies - 10 samples of loops of 100,000
Code:
## Basic Loop efficiencies
1..10 | % {
[pscustomobject]@{
ForeachObject = "{0:N2}" -f ((Measure-Command { 1..100000 | ForEach-Object {} }).TotalSeconds)
Foreach = "{0:N2}" -f ((Measure-Command { $arr = (1..100000); foreach ($i in $arr) {} }).TotalSeconds)
For = "{0:N2}" -f ((Measure-Command { $arr = (1..100000); for ($i=1; $i -lt $arr.Count; $i++) {}}).TotalSeconds)
DoUntil = "{0:N2}" -f ((Measure-Command { $i = 1; do {$i++} Until ($i -eq 99999) }).TotalSeconds)
DoWhile = "{0:N2}" -f ((Measure-Command { $i = 1; do {$i++} While ($i -lt 100000) }).TotalSeconds)
}
} | Format-Table
Output (in total seconds):
ForeachObject Foreach For DoUntil DoWhile
------------- ------- --- ------- -------
0.64 0.08 0.29 0.17 0.18
0.66 0.07 0.33 0.16 0.17
0.65 0.08 0.29 0.17 0.17
0.72 0.07 0.31 0.19 0.16
0.64 0.07 0.28 0.16 0.16
0.62 0.07 0.30 0.19 0.19
0.73 0.07 0.31 0.16 0.18
0.73 0.08 0.33 0.17 0.17
0.64 0.08 0.32 0.16 0.18
0.65 0.07 0.29 0.16 0.16
Array Creation - 10 samples of adding 10K, 25K, 50K
Code:
## Basic Array Appending
1..10 | % {
[pscustomobject]@{
Array_PlusEquals_10K = "{0:N2}" -f ((Measure-Command { $out = @(); $arr = (1..10000); foreach ($i in $arr) { $out += $i }}).TotalSeconds)
ArrayList_Add_Method_10K = "{0:N2}" -f ((Measure-Command { $out = [System.Collections.ArrayList]::new(); $arr = (1..10000); foreach ($i in $arr) { [void]$out.Add($i) }}).TotalSeconds)
Array_Assignment_10K = "{0:N2}" -f ((Measure-Command { $arr = (1..10000); $out = foreach ($i in $arr) { $i }}).TotalSeconds)
Array_PlusEquals_25K = "{0:N2}" -f ((Measure-Command { $out = @(); $arr = (1..25000); foreach ($i in $arr) { $out += $i }}).TotalSeconds)
ArrayList_Add_Method_25K = "{0:N2}" -f ((Measure-Command { $out = [System.Collections.ArrayList]::new(); $arr = (1..25000); foreach ($i in $arr) { [void]$out.Add($i) }}).TotalSeconds)
Array_Assignment_25K = "{0:N2}" -f ((Measure-Command { $arr = (1..25000); $out = foreach ($i in $arr) { $i }}).TotalSeconds)
Array_PlusEquals_50K = "{0:N2}" -f ((Measure-Command { $out = @(); $arr = (1..50000); foreach ($i in $arr) { $out += $i }}).TotalSeconds)
ArrayList_Add_Method_50K = "{0:N2}" -f ((Measure-Command { $out = [System.Collections.ArrayList]::new(); $arr = (1..50000); foreach ($i in $arr) { [void]$out.Add($i) }}).TotalSeconds)
Array_Assignment_50K = "{0:N2}" -f ((Measure-Command { $arr = (1..50000); $out = foreach ($i in $arr) { $i }}).TotalSeconds)
}
} | Format-Table
Output (in total seconds):
Array_PlusEquals_10K ArrayList_Add_Method_10K Array_Assignment_10K Array_PlusEquals_25K ArrayList_Add_Method_25K Array_Assignment_25K Array_PlusEquals_50K ArrayList_Add_Method_50K Array_Assignment_50K
-------------------- ------------------------ -------------------- -------------------- ------------------------ -------------------- -------------------- ------------------------ --------------------
3.32 0.04 0.01 20.64 0.05 0.04 79.97 0.10 0.07
3.15 0.02 0.01 20.68 0.05 0.03 80.90 0.11 0.07
3.23 0.05 0.01 21.30 0.04 0.04 82.65 0.09 0.07
3.16 0.02 0.01 21.46 0.05 0.03 80.05 0.09 0.07
3.16 0.02 0.01 21.33 0.04 0.04 87.93 0.10 0.07
3.04 0.02 0.01 19.65 0.05 0.03 80.12 0.09 0.07
3.05 0.02 0.01 19.81 0.05 0.03 81.22 0.11 0.07
3.04 0.02 0.01 19.60 0.05 0.04 79.62 0.12 0.07
3.16 0.02 0.01 20.69 0.05 0.04 82.85 0.10 0.07
3.31 0.02 0.01 21.95 0.05 0.03 81.19 0.11 0.07
Conclusions:
Loops:
Adding data to Arrays:
Upvotes: 1
Reputation: 25001
If it is a memory issue then the syntax $NewCSVObject += $_
is doing you no favors at all. I would make at least the following modifications.
# Change the following
$NewCSVObject += $_ # Replace This Line
$_ # Replacement Code
$masterCSV | ForEach-Object { # Replace this Line
$NewCSVObject = $masterCSV | ForEach-Object { # Replacement Code
$NewCSVObject | Export-Csv "$newLocalMaster" -notype # Move this line to outside of the Foreach-Object {} script block.
Explanation:
Using +=
to effectively add elements to an array creates a new copy of the current array with all of its current elements and then adds the additional items to the output. As the array grows, this operation becomes increasingly costly.
When using a foreach
loop or Foreach-Object {}
script block, any output produced within those constructs can be stored into a variable by just assigning the variable to the loop or command. The variable will be a collection if there is more than one output. Below is sample syntax to illustrate the concept.
$variable = foreach ($item in $Collection) {
$item # $item is output but stored in $variable instead of being sent to stdout/console
}
$variable = $collection | Foreach-Object {
$_ # The current pipeline object is output but stored in $variable instead of being sent to stdout/console
}
You only need to output to your CSV file at the end of the loop activities provided other system constraints don't prevent you from doing so. If you output to CSV continuously, you need to use the -Append
switch on Export-Csv
to prevent overwriting of previous data.
Upvotes: 1