Reputation: 35
I have a PowerShell program that reads a file and looks for accounts in another table and then updates and writes out a new file with the extra information. It actually passes the file twice. Once to accumulate totals and then again to build the output file. (Whether it's a table or a file pass, I cannot sort the data because it has to be kept in original order so that transactions stay in balance.)
I've tried loading the file into an array and passing that twice and I've tried reading the file twice without building the array. Processing the file twice is orders of magnitude faster and that's not counting how long it takes to load the file, this is just my loop time.
When I run with a file that is about 12 MB with 84,000 rows, it runs fairly quickly. The whole thing takes about 5 minutes (Processing the file, not using a table).
When I process a larger file, (1.2 GB with 7 million records.) it takes way longer. The one with a table estimates it will take over 2 days. Running it by reading the file twice takes about 8 hours.
I know 7 million records is a lot, but 8 hours is a REALLY long time on a Windows server with 256 GB of RAM and way more processors than PowerShell can even use. (Even the 5 minutes to process 84,000 records seems like an eternity.)
When I was building the table, PowerShell would consume over 30 GB of RAM. It seems that when PowerShell has a large memory footprint, the performance drops drastically.
I've read a lot of things about PowerShell performance. I am not building strings with ++, I'm not piping a lot of things into other things. This really is a very basic, read a csv into an array, and then read a file and for each record in the file pass the array and when an item on the file is found in the array, do some simple calculations and move on. (the array isn't sorted because I am actually searching on 2 different fields in the array and there are duplicate values in both, so I have to process the whole thing each time, but, the array has less than 400 records in it.)
So, when I do the math, processing 7 million records, each searching 400 records, it sounds like a lot, and it is, but still, 8 hours on a high-end windows server? Is this really how slow PowerShell is?
Thank you to any and all who have read this far and have suggestions or advice. -Todd
EDIT: Here is some of the code. The table that is read into memory is loaded with this command:
$ICList = Import-Csv -Path $Path2 -encoding Default
It has 31 sets of buckets that are used to accumulate totals for each day of the month (2 totals and on percent per day)
Here are the headers and a sample record:
ICKey,ICAC,OffsetAC,OffsetBank,Type,Last,Total,offtotal,pct,Total01,offtotal01,pct01,Total02,offtotal02,pct02,Total03,offtotal03,pct03,Total04,offtotal04,pct04,Total05,offtotal05,pct05,Total06,offtotal06,pct06,Total07,offtotal07,pct07,Total08,offtotal08,pct08,Total09,offtotal09,pct09,Total10,offtotal10,pct10,Total11,offtotal11,pct11,Total12,offtotal12,pct12,Total13,offtotal13,pct13,Total14,offtotal14,pct14,Total15,offtotal15,pct15,Total16,offtotal16,pct16,Total17,offtotal17,pct17,Total18,offtotal18,pct18,Total19,offtotal19,pct19,Total20,offtotal20,pct20,Total21,offtotal21,pct21,Total22,offtotal22,pct22,Total23,offtotal23,pct23,Total24,offtotal24,pct24,Total25,offtotal25,pct25,Total26,offtotal26,pct26,Total27,offtotal27,pct27,Total28,offtotal28,pct28,Total29,offtotal29,pct29,Total30,offtotal30,pct30,Total31,offtotal31,pct31
123456XX0012,123456,987654,XX0012,12m,no,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
The file that is read in the next loop looks like this:
001,Company_Reference_ID,XX0012,USD,ACTUALS,2020-12-31,Conversion,,,,00000003,,,000000005376026.81,,,,,,123456,00100217,,,
The loop that accumulates the totals looks like this:
$reader = [System.IO.File]::OpenText($path)
$line = $reader.ReadLine()
try {
while ($null -ne ($line = $reader.ReadLine())) {
$item = $line.Split(",")
$thisDate = $item[5].substring(8, 2) #AccountingDate
$ttot = 'Total' + $thisDate
$toff = 'Offtotal' + $thisDate
foreach ($item2 in $ICList) {
$itco = $item[2] #CompanyReferenceID
$itkey = $item[19] #ExternalCode_Account
if ($item2.Type -eq 'm2m' -and $itco -ne 'LE0001') { $itco = 'LE0004' }
if ($itkey -eq '109495') { $itkey = '109479' }
if ($itkey -eq '172574') { $itkey = '172697' }
if ($itkey -eq '172718') { $itkey = '172590' }
$thisrecord = $itkey + $itco
if ($thisrecord -eq $item2.ICKey) {
$item2.$ttot = [Double]$item2.$ttot + [Double]$item[14] - [Double]$item[13] # .Creditamount .Debitamount
}
if ($itkey -eq $item2.offsetac) {
$Item2.$toff = [Double]$Item2.$toff + [Double]$item[14] - [Double]$item[13] #.Creditamount .Debitamount
}
}
$count++
}
}
catch {
Write-Host $_.Exception.Message
}
finally {
$reader.Close()
}
The above loop takes the bulk of the time.
I then have a little loop that spins through the $ICList table to calculate percentages:
foreach ($item2 in $ICList) {
foreach ($day in $days) {
$ttot = 'Total' + '{0:d2}' -f $day
$toff = 'Offtotal' + '{0:d2}' -f $day
$tpct = 'Pct' + '{0:d2}' -f $day
$item2.$tpct = ([double]$Item2.$ttot) / ([double]$Item2.$toff)
}
}
And then I read the file again and write it out with extra records to split a single amount to multiple companies and adds new column with the company to the end. This part actually runs faster than the first loop. (Even though though it's doing a lot more processing)
$first = 1
$count = 1
$reader = [System.IO.File]::OpenText($path)
$outputFileName = [io.path]::ChangeExtension($path, "txt")
try {
$outputFileName = [io.path]::GetFileNameWithoutExtension($path) + '-' + $date + ".fix.csv"
$outputFileName = [io.path]::Combine([io.path]::GetDirectoryName($path), $outputFileName)
$writer = [System.IO.StreamWriter] $outputFileName
$line = $reader.ReadLine()
$writer.WriteLine($line)
while ($null -ne ($line = $reader.ReadLine())) {
$item = $line.Split(",")
if ($first -eq 1) {
$lastjourn = $item[0] #.journalkey
$first = 999
}
if ($lastjourn -ne $item[0]) {
# .JournalKey
$count = 1
}
$offsetbank = $null
$chgCR = $item[14] #.CreditAmount
$chgDB = $item[13] #.DebitAmount
foreach ($item2 in $ICList) {
$itkey = $item[19] #.ExternalCode_Account
if ($itkey -eq $item2.offsetac) {
$newCR = $null
$newDB = $null
$offsetbank = $item2.offsetbank
$tpct = 'Pct' + $item[5].substring(8, 2) #accountingDate
if ($item2.$tpct -ne 1) {
if ('-' + $item[14] + '-' -ne '--') {
#.CreditAmount
$newCR = [math]::round([double]$item[14] * $item2.$tpct, 2) #.CreditAmount
if ($item2.last -eq 'no') {
$chgCR = [math]::round([double]$chgCR - [double]$newCR, 2)
}
}
if ('-' + $item[13] + '-' -ne '--') {
#.DebitAmount
$thispct = $item2.$tpct * -1
$newDB = [math]::round([double]$item[13] * $thispct, 2) #.DebitAmount
if ($item2.last -eq 'no') {
$chgDB = [math]::round([double]$chgDB - [double]$newDB, 2)
}
}
if ($item[19] -eq '202884' -or $item[19] -eq '103886') {
}
if (('-' + $NewDB + '-' -ne '--' -and $newdb -ne 0) -or ('-' + $newCR + '-' -ne '--' -and $newcr -ne 0)) {
if ($item2.last -eq 'yes') {
$item[14] = [double]$chgCR #.CreditAmount
$item[13] = [double]$chgDB #.DebitAmount
$offsetbank = $item2.offsetbank
}
else {
$tmpcr = [double]$newcr
$tmpdb = [double]$newdb
$xyzd = 0
$xyzc = 0
if ($tmpcr -eq 0) {
$xyzd = [double]$tmpdb.ToString("0.00")
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}"
, $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9]
, $count, $item[11], $item[12], [math]::abs($xyzd.tostring("0.00")), $null, $item[15], $item[16], $item[17], $item[18], $item[19]
, $item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID_EXTRA', $item2.offsetbank)
}
if ($tmpdb -eq 0) {
$xyzc = [double]$tmpcr.ToString("0.00")
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}"
, $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9]
, $count, $item[11], $item[12], $null, [math]::abs($xyzc.tostring("0.00")), $item[15], $item[16], $item[17], $item[18], $item[19]
, $item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID_EXTRA', $item2.offsetbank)
}
$count++
}
}
else {
if ($item2.last -eq 'yes') {
$item[14] = [double]$chgCR #.CreditAmount
$item[13] = [double]$chgDB #.DebitAmount
$offsetbank = $item2.offsetbank
}
}
}
}
}
$tmpcr = [double]$item[14]
$tmpdb = [double]$item[13]
$xyzd = 0
$xyzc = 0
if ($tmpcr -eq 0) {
$xyzd = [double]$tmpdb.ToString("0.00")
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
$item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9],
$count, $item[11], $item[12], $xyzd.tostring("0.00"), $null, $item[15], $item[16], $item[17], $item[18], $item[19],
$item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID', $offsetbank)
}
if ($tmpdb -eq 0) {
$xyzc = [double]$tmpcr.ToString("0.00")
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
$item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9],
$count, $item[11], $item[12], $null , $xyzc.tostring("0.00"), $item[15], $item[16], $item[17], $item[18], $item[19],
$item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID', $offsetbank)
}
if ($tmpdb -ne 0 -and $tmpcr -ne 0) {
$xyzc = [double]$tmpcr.ToString("0.00")
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
$item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9],
$count, $item[11], $item[12], $null , $xyzc.tostring("0.00"), $item[15], $item[16], $item[17], $item[18], $item[19],
$item[20], $item[21], 'INTERCOMPANY_AFFILIATE_ID', $offsetbank)
}
$count = $count + 1
# }
$lastjourn = $item[0]
}
}
catch {
Write-Log -Message $_.Exception.Message -Path $scriptlog -Colorf Black -Colorb Yellow
}
finally {
$reader.Close()
$writer.Close()
}
Here is a log output showing timestamps for each section.
2022-06-13 13:47:01 INFO: -------------------------------------------------------------------
2022-06-13 13:47:01 INFO: Program Begins
2022-06-13 13:47:01 INFO: -------------------------------------------------------------------
2022-06-13 13:47:01 INFO: Loading C:\ICAccounts.csv into internal table
2022-06-13 13:47:01 INFO: Reading from C:\JAN21.end.CSV to accumulate totals
2022-06-13 13:47:01 INFO: Counting records in C:\JAN21.end.CSV
2022-06-13 13:47:03 INFO: record count is: 84853
2022-06-13 13:47:03 INFO: Reading from C:JAN21.end.CSV to accumulate totals
2022-06-13 13:50:34 INFO: Calculating Percentages in internal table
2022-06-13 13:50:35 INFO: Reading from C:\JAN21.end.CSV again to calculate totals...
2022-06-13 13:50:35 INFO: ...and writing records to C:\JAN21.end-06132022-134701.fix.csv
2022-06-13 13:52:12 INFO: Process Complete
2022-06-13 13:52:12 INFO: Script Elapsed Time (HH:MM:SS) 00:05:10
First loop took ~3.5 minutes.
Second loop tool ~1.5 minutes
EDIT2: @zett42, this sounds like an intriguing idea. However, I've never mixed C# and PS before. I tried to replace the string code in the first while loop with something like this, but it cannot see the PS variables. How can I get PS and PS to both see the Strings and array?
$id = get-random
$code = @"
using System;
namespace HelloWorld
{
public class Program$ID
{
public static void Main(){
itco = item[2];
itkey = item[19];
if (item2.Type == "m2m" && itco != "LE0001")
{
itco = "LE0004";
}
if (itkey == "109495")
{
itkey = "109479";
}
if (itkey == "172574")
{
itkey = "172697";
}
if (itkey == "172718")
{
itkey = "172590";
}
thisrecord = itkey + itco;
if (item2.icac == "202884" || item2.icac == "103886")
{
}
if (thisrecord == item2.ICKey)
{
ttot = (Double)ttot + (Double)item[14] - (Double)item[13];
if (itkey == "202884" || itkey == "103886")
{
}
}
if (itkey == item2.offsetac)
{
toff = (Double)toff + (Double)item[14] - (Double)item[13];
}
}
}
}
"@
I am calling this code with:
while ($null -ne ($line = $reader.ReadLine())) {
Add-Type -TypeDefinition $code -Language CSharp
iex "[HelloWorld.Program]::Main()"
}
Thanks, again.
Upvotes: 0
Views: 1025
Reputation: 35
The script running with data that took almost five minutes now runs in 18 seconds!!
The 8 hour data file takes less than 14 minutes to run (And 2 of that is just counting the records in the file so I can build a progress bar)
Before any changes:
Script Elapsed Time (HH:MM:SS) 00:04:54
After moving as much as I could outside the inner loop:
Script Elapsed Time (HH:MM:SS) 00:02:56
After using the hash table:
Script Elapsed Time (HH:MM:SS) 00:00:18
Thank you to @mistersmith, their suggestion of a hash table made me rethink the loop and I used a hash table to tell if the loop was needed. That had the largest impact.
Also, thank you to @mclayton. The recommendation to move parts outside the inner loop by itself cut run time about in half. However, with the hash table, the loop is being executed significantly less, so the impact is somewhat reduced, but it was still an excellent suggestion and a reminder to always do things as few times as you can get away with.
The code I added was:
Read the list into a hash table separate from the table I was using. (I know I could have just used the hash table for everything, but the table is small and this required less overall code changes to what I already wrote):
$ICACList = @{}
Import-Csv -Path $Path2 | ForEach-Object { $ICACList[$_.'ICAC'] = $_ }
Then, I simply added an if statement before each of the inner while loops:
if ($icaclist[$itkey]) {
foreach ($item2 in $ICList) {
if ($icaclist[$itkey]) {
foreach ($item2 in $ICList) {
Turns out the mix of work on these files, there really are very few items that qualified for the inner loop, so reducing that had a huge impact.
Thank you to everyone who responded. It really was a combination of suggestions that helped me greatly improve this script and I learned a lot in the process.
Upvotes: 3