immobile2
immobile2

Reputation: 569

PowerShell IO large files without loading everything to memory

I have a scenario where I need to edit very large files and the end result is rather simple, but achieving it has become a bit of a drag on my computer and memory. Due to downstream systems, I cannot load a duplicate file (according to a computed hash) twice. My workaround has been to move the first actual line/record to the end of the file without changing anything else. This method (shown below in Method 1) works great for files that are small enough, but now I have files that are extremely large. So I began working on Method 2 below, but I haven't quite figured out how to stream lines from an input file into an output file.

#Method 1
$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}
$files = (Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File)
Foreach ($inputFile in $files){
    $A = Get-Content $inputFile
    $Header = $A[0]
    $Data = $A[2..($A.Count-1)]
    $Footer = $A[1]
    $Header, $Data, $Footer | Add-Content -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"
}
#Work-in-progress Method 2
$inputFile = "\Input.csv"
$outputFile = "\Output.csv"

#Create StringReader
$sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))

#Create StringWriter
$sw = [System.IO.StringWriter]::New()

#Write the Header
$sw.Write($sr.ReadLine())

#Get the first actual record as a string
$lastLine = $sr.ReadLine()

#Write the rest of the lines
$sw.Write($sr.ReadToEnd())

#Add the final line
$sw.Write($lastLine)

#Write everything to the outputFile
[System.IO.File]::WriteAllText($outputFile, $sw.ToString())

Get-Content:
Line |
   5 |  $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
     |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | Insufficient memory to continue the execution of the program.
MethodInvocationException:
Line |
   5 |  $sr = [System.IO.StringReader]::New((Get-Content $inputFile -Raw))
     |  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | Exception calling ".ctor" with "1" argument(s): "Value cannot be null. (Parameter 's')"

I'm having a bit of trouble comprehending the difference between a StringWriter itself and a StringBuilder, for example - why would I choose to use the StringWriter as I have rather than simply work with a StringBuilder directly? Most importantly though, the current iteration of Method 2 requires more memory than my system has and it isn't actually streaming the characters/lines/data from the input file to the output file. Are there built in methods for checking memory that I'm overlooking, or is there simply a better way to achieve my goal?

Upvotes: 1

Views: 1316

Answers (3)

Gregor y
Gregor y

Reputation: 2050

If you'd need to go a bit faster on many larger files, and you know for certain that your csv data is clean, you could also use the binary IO.FileStream.

Basically the below example takes a core sample off the top of the file, scans for the header and footer lines. Then writes the header, dumps the remainder of the sample, and uses the stream class's CopyTo instead of PowerShell's while loop to get the speed boost, and finally writes the footer.

#asuming that .csv file lines end with CRLF ie bytes 13,10
# things go terribly wrong if this is not true
[byte]$ByteCR = 13 # 0D
[byte]$ByteLF = 10 # 0A
function Find-OffsetPastNextEOL{
   param([System.Collections.IEnumerator]$enu)
   $QuotedState = $false #ToDo: csv files can possibly have multiple lines per record
   $CRLF_found  = $false
   $count = 0
   while($enu.MoveNext() -and !$CRLF_found){ #expected to be a lot less iterations than the number of lines in the file
      $count++
      if($enu.Current -eq $ByteCR -and $enu.MoveNext()){
         $count++
         $CRLF_found = $enu.Current -eq $ByteLF
      }
   }
   return $count
}
function Test-EndOfFileHasEOL{
   param([System.IO.FileStream]$read)
   $null = $read.Seek(-2,'End')
   return $read.ReadByte() -eq $ByteCR -and $read.ReadByte() -eq $ByteLF
}

$BufferSize = 100mb
$SampleSize =   1mb #idealy something just big enough to make sure you get the first two lines of every file
$SampleWithHeadAndFoot = new-object byte[] $SampleSize
Foreach ($inputFile in $files){
   try{
#[IO.FileStream]::new(($IWantThis=$FullPath),($InorderTo='Open'),($IWill='Read'),($AtTheSameTimeOthersCan='Read'),($BytesAtOnce=$BufferSize))

      $ReadFilePath  = $inputFile.FullName
      $read  = [IO.FileStream]::new($ReadFilePath ,'Open'  ,'Read' ,'Read',$BufferSize)

      $WriteFilePath = $ReadFilePath -Replace ($inputFile.Name+'$'),"$Prefix`$0"
      $write = [IO.FileStream]::new($WriteFilePath,'Append','Write','None',$BufferSize)

      $TotalBytesSampled = $read.Read($SampleWithHeadAndFoot, 0, $SampleSize)
      #ToDo: check for BOM or other indicators that the csv data is one-byte ASCII or UTF8

      $enu = $SampleWithHeadAndFoot.GetEnumerator()
      $HeaderLength = 0 + (Find-OffsetPastNextEOL $enu)
      $FooterLength = 1 + (Find-OffsetPastNextEOL $enu)
      $DataStartPosition = $HeaderLength + $FooterLength
      $OversampleLength  = $TotalBytesSampled - ($HeaderLength + $FooterLength)

      $write.Write($SampleWithHeadAndFoot,0,$HeaderLength)             #write the header from the sample
      if($DataStartPosition -lt $TotalBytesSampled - 1){               #flush the sample data after the first record
         $write.Write($SampleWithHeadAndFoot,$DataStartPosition,$OversampleLength)
      }
      $read.CopyTo($write,$BufferSize)                                 #flush the rest of the data still in the read stream
      if(!(Test-EndOfFileHasEOL $read)){                               #inject CRLF if EOF didn't already have one
         $write.WriteByte($ByteCR)
         $write.WriteByte($ByteLF)
      }
      $write.Write($SampleWithHeadAndFoot,$HeaderLength,$FooterLength) #write the first record as the footer
   }finally{
      $read.Dispose()
      $write.Dispose()
   }
}

I manually tested with the below setup, but you'll likely want to make a few modifications for production code to make it a bit more robust against csv data found in the wild.

PS> sc xx.csv -Value @"
this is a header`r
this was line 1`r
this was line 2`r
this was line 3`r
this was line 4`r
this was line 5
"@ -Encoding utf8 -NoNewLine

PS> $Prefix     = '1_'
PS> $files      = ls xx.csv
PS> $SampleSize = 40

PS> Format-Hex .\xx.csv

           Path: .\xx.csv

           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

00000000   EF BB BF 74 68 69 73 20 69 73 20 61 20 68 65 61  this is a hea
00000010   64 65 72 0D 0A 74 68 69 73 20 77 61 73 20 6C 69  der..this was li
00000020   6E 65 20 31 0D 0A 74 68 69 73 20 77 61 73 20 6C  ne 1..this was l
00000030   69 6E 65 20 32 0D 0A 74 68 69 73 20 77 61 73 20  ine 2..this was
00000040   6C 69 6E 65 20 33 0D 0A 74 68 69 73 20 77 61 73  line 3..this was
00000050   20 6C 69 6E 65 20 34 0D 0A 74 68 69 73 20 77 61   line 4..this wa
00000060   73 20 6C 69 6E 65 20 35                          s line 5

PS> Format-Hex .\1_xx.csv

           Path: .\1_xx.csv

           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

00000000   EF BB BF 74 68 69 73 20 69 73 20 61 20 68 65 61  this is a hea
00000010   64 65 72 0D 0A 74 68 69 73 20 77 61 73 20 6C 69  der..this was li
00000020   6E 65 20 32 0D 0A 74 68 69 73 20 77 61 73 20 6C  ne 2..this was l
00000030   69 6E 65 20 33 0D 0A 74 68 69 73 20 77 61 73 20  ine 3..this was
00000040   6C 69 6E 65 20 34 0D 0A 74 68 69 73 20 77 61 73  line 4..this was
00000050   20 6C 69 6E 65 20 35 0D 0A 74 68 69 73 20 77 61   line 5..this wa
00000060   73 20 6C 69 6E 65 20 31 0D 0A                    s line 1..

Upvotes: 0

Santiago Squarzon
Santiago Squarzon

Reputation: 60110

This is how your code would look using StreamReader and StreamWriter:

Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File | ForEach-Object {
    try {
        $path   = "$($_.DirectoryName)\$Prefix$($_.BaseName).csv"
        $writer = [IO.StreamWriter] $path
        $stream = $_.OpenRead()
        $reader = [IO.StreamReader] $stream
        $writer.WriteLine($reader.ReadLine()) # => This is header
        $footer = $reader.ReadLine()
        while(-not $reader.EndOfStream) {
            $writer.WriteLine($reader.ReadLine())
        }
        $writer.WriteLine($footer)
    }
    finally {
        $stream, $reader, $writer | ForEach-Object Dispose
    }
}

This method will keep memory usage as low as possible and will be as efficient as it gets.

Upvotes: 0

iRon
iRon

Reputation: 23673

The nice thing of the PowerShell Pipeline is that it streams by nature.
If correctly used, meaning:

  • Do not assign any pipeline results to a variable and
  • Do not use parenthesis

As that will choke the pipeline.

In your case:

$Prefix = Read-Host -Prompt "What do you want to use as the prefix for the updated file names? (The number 1 is the default)"
If ([string]::IsNullOrEmpty($Prefix)){$Prefix = '1_'}
If($Prefix[-1] -ne '_'){$Prefix = "$($Prefix)_"}

Get-ChildItem -LiteralPath $PWD -Filter '*.csv' -File |
Import-Csv |ForEach-Object -Begin { $Index = 0 } -Process {
    if ($Index++) { $_ } else { $Footer = $_ }
} -End { $Footer } |
Export-Csv -LiteralPath "$($inputFile.DirectoryName)\$($Prefix)$($inputFile.BaseName).csv"

Upvotes: 1

Related Questions