Reputation: 569
I have a scenario where I need to edit very large files and the end result is rather simple, but achieving it has become a bit of a drag on my computer and memory. Due to downstream systems, I cannot load a duplicate file (according to a computed hash) twice. My workaround has been to move the first actual line/record to the end of the file without changing anything else. This method (shown below in Method 1) works great for files that are small enough, but now I have files that are extremely large. So I began working on Method 2 below, but I haven't quite figured out how to stream lines from an input file into an output file.
#Method 1
$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
$files = (Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File)
Foreach ($inputFile in $files){
$A = Get-Content $inputFile
$Header = $A[0]
$Data = $A[2..($A.Count-1)]
$Footer = $A[1]
$Header, $Data, $Footer | Add-Content -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"
}
#Work-in-progress Method 2
$inputFile = "\Input.csv"
$outputFile = "\Output.csv"
#Create StringReader
$sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
#Create StringWriter
$sw = [System.IO.StringWriter]::New()
#Write the Header
$sw.Write($sr.ReadLine())
#Get the first actual record as a string
$lastLine = $sr.ReadLine()
#Write the rest of the lines
$sw.Write($sr.ReadToEnd())
#Add the final line
$sw.Write($lastLine)
#Write everything to the outputFile
[System.IO.File]::WriteAllText($outputFile, $sw.ToString())
Get-Content:
Line |
5 | $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Insufficient memory to continue the execution of the program.
MethodInvocationException:
Line |
5 | $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Exception calling ".ctor" with "1" argument(s): "Value cannot be null. (Parameter 's')"
I'm having a bit of trouble comprehending the difference between a StringWriter
itself and a StringBuilder
, for example - why would I choose to use the StringWriter
as I have rather than simply work with a StringBuilder
directly? Most importantly though, the current iteration of Method 2 requires more memory than my system has and it isn't actually streaming the characters/lines/data from the input file to the output file. Are there built in methods for checking memory that I'm overlooking, or is there simply a better way to achieve my goal?
Upvotes: 1
Views: 1316
Reputation: 2050
If you'd need to go a bit faster on many larger files, and you know for certain that your csv data is clean, you could also use the binary IO.FileStream
.
Basically the below example takes a core sample off the top of the file, scans for the header and footer lines. Then writes the header, dumps the remainder of the sample, and uses the stream class's CopyTo instead of PowerShell's while loop to get the speed boost, and finally writes the footer.
#asuming that .csv file lines end with CRLF ie bytes 13,10
# things go terribly wrong if this is not true
[byte]$ByteCR = 13 # 0D
[byte]$ByteLF = 10 # 0A
function Find-OffsetPastNextEOL{
param([System.Collections.IEnumerator]$enu)
$QuotedState = $false #ToDo: csv files can possibly have multiple lines per record
$CRLF_found = $false
$count = 0
while($enu.MoveNext() -and !$CRLF_found){ #expected to be a lot less iterations than the number of lines in the file
$count++
if($enu.Current -eq $ByteCR -and $enu.MoveNext()){
$count++
$CRLF_found = $enu.Current -eq $ByteLF
}
}
return $count
}
function Test-EndOfFileHasEOL{
param([System.IO.FileStream]$read)
$null = $read.Seek(-2,'End')
return $read.ReadByte() -eq $ByteCR -and $read.ReadByte() -eq $ByteLF
}
$BufferSize = 100mb
$SampleSize = 1mb #idealy something just big enough to make sure you get the first two lines of every file
$SampleWithHeadAndFoot = new-object byte[] $SampleSize
Foreach ($inputFile in $files){
try{
#[IO.FileStream]::new(($IWantThis=$FullPath),($InorderTo='Open'),($IWill='Read'),($AtTheSameTimeOthersCan='Read'),($BytesAtOnce=$BufferSize))
$ReadFilePath = $inputFile.FullName
$read = [IO.FileStream]::new($ReadFilePath ,'Open' ,'Read' ,'Read',$BufferSize)
$WriteFilePath = $ReadFilePath -Replace ($inputFile.Name+'$'),"$Prefix`$0"
$write = [IO.FileStream]::new($WriteFilePath,'Append','Write','None',$BufferSize)
$TotalBytesSampled = $read.Read($SampleWithHeadAndFoot, 0, $SampleSize)
#ToDo: check for BOM or other indicators that the csv data is one-byte ASCII or UTF8
$enu = $SampleWithHeadAndFoot.GetEnumerator()
$HeaderLength = 0 + (Find-OffsetPastNextEOL $enu)
$FooterLength = 1 + (Find-OffsetPastNextEOL $enu)
$DataStartPosition = $HeaderLength + $FooterLength
$OversampleLength = $TotalBytesSampled - ($HeaderLength + $FooterLength)
$write.Write($SampleWithHeadAndFoot,0,$HeaderLength) #write the header from the sample
if($DataStartPosition -lt $TotalBytesSampled - 1){ #flush the sample data after the first record
$write.Write($SampleWithHeadAndFoot,$DataStartPosition,$OversampleLength)
}
$read.CopyTo($write,$BufferSize) #flush the rest of the data still in the read stream
if(!(Test-EndOfFileHasEOL $read)){ #inject CRLF if EOF didn't already have one
$write.WriteByte($ByteCR)
$write.WriteByte($ByteLF)
}
$write.Write($SampleWithHeadAndFoot,$HeaderLength,$FooterLength) #write the first record as the footer
}finally{
$read.Dispose()
$write.Dispose()
}
}
I manually tested with the below setup, but you'll likely want to make a few modifications for production code to make it a bit more robust against csv data found in the wild.
PS> sc xx.csv -Value @"
this is a header`r
this was line 1`r
this was line 2`r
this was line 3`r
this was line 4`r
this was line 5
"@ -Encoding utf8 -NoNewLine
PS> $Prefix = '1_'
PS> $files = ls xx.csv
PS> $SampleSize = 40
PS> Format-Hex .\xx.csv
Path: .\xx.csv
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 EF BB BF 74 68 69 73 20 69 73 20 61 20 68 65 61 this is a hea
00000010 64 65 72 0D 0A 74 68 69 73 20 77 61 73 20 6C 69 der..this was li
00000020 6E 65 20 31 0D 0A 74 68 69 73 20 77 61 73 20 6C ne 1..this was l
00000030 69 6E 65 20 32 0D 0A 74 68 69 73 20 77 61 73 20 ine 2..this was
00000040 6C 69 6E 65 20 33 0D 0A 74 68 69 73 20 77 61 73 line 3..this was
00000050 20 6C 69 6E 65 20 34 0D 0A 74 68 69 73 20 77 61 line 4..this wa
00000060 73 20 6C 69 6E 65 20 35 s line 5
PS> Format-Hex .\1_xx.csv
Path: .\1_xx.csv
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 EF BB BF 74 68 69 73 20 69 73 20 61 20 68 65 61 this is a hea
00000010 64 65 72 0D 0A 74 68 69 73 20 77 61 73 20 6C 69 der..this was li
00000020 6E 65 20 32 0D 0A 74 68 69 73 20 77 61 73 20 6C ne 2..this was l
00000030 69 6E 65 20 33 0D 0A 74 68 69 73 20 77 61 73 20 ine 3..this was
00000040 6C 69 6E 65 20 34 0D 0A 74 68 69 73 20 77 61 73 line 4..this was
00000050 20 6C 69 6E 65 20 35 0D 0A 74 68 69 73 20 77 61 line 5..this wa
00000060 73 20 6C 69 6E 65 20 31 0D 0A s line 1..
Upvotes: 0
Reputation: 60110
This is how your code would look using StreamReader
and StreamWriter
:
Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File | ForEach-Object {
try {
$path = "$($_.DirectoryName)\$Prefix$($_.BaseName).csv"
$writer = [IO.StreamWriter] $path
$stream = $_.OpenRead()
$reader = [IO.StreamReader] $stream
$writer.WriteLine($reader.ReadLine()) # => This is header
$footer = $reader.ReadLine()
while(-not $reader.EndOfStream) {
$writer.WriteLine($reader.ReadLine())
}
$writer.WriteLine($footer)
}
finally {
$stream, $reader, $writer | ForEach-Object Dispose
}
}
This method will keep memory usage as low as possible and will be as efficient as it gets.
Upvotes: 0
Reputation: 23673
The nice thing of the PowerShell Pipeline is that it streams by nature.
If correctly used, meaning:
As that will choke the pipeline.
In your case:
$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File |
Import-Csv |ForEach-Object -Begin { $Index = 0 } -Process {
if ($Index++) { $_ } else { $Footer = $_ }
} -End { $Footer } |
Export-Csv -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"
Upvotes: 1