Ankit
Ankit

Reputation: 1887

Preserve Line Feeds and Carriage Returns Powershell

I have a csv file originated from Linux Server, so they contains \n to mark the ending of a line. Now I process this file through a powershell script and replace it's 'text qualifier' to something else (that I do because I am using SSIS to upload the csv feed to database and for some odd reasons SSIS don't supports 'embedded text qualifiers')

Part of the script which do this replacement looks like this

gc $file.FullName |
    % { if($_.indexOf("|~|") -eq -1) {$_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""} else {$_ -replace " ", " "}} |
    sc  $temppath

This scripts works fine but that also change the line feed at the end to \r\n That I understand should not have been that big of problem until I realized that my original feed also contains occasional \r in the description column which is also getting replaced with "\r\n" Now SSIS package is unable to identify where' the csv line ends.

I searched and found this is due to Get-Content which work line by line so I changed the command to following.

[System.IO.File]::ReadAllText($file.FullName) |
            % { if($_.indexOf("|~|") -eq -1) {$_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""} else {$_ -replace " ", " "}} |
            sc  $temppath

That seems to solve my issue but now I am trapped with *"OutOfMemoryException" as some of the csv files are big (about 400-500 MB) *

Any Suggestions what I can possibily do? Perhaps some replacement for ReadAllText() that works for big files?

Upvotes: 0

Views: 3644

Answers (1)

Adi Inbar
Adi Inbar

Reputation: 12323

The problem is that by piping the output of Get-Content, you're ripping the file into individual lines, and then Set-Content combines those line into a new file. Since CR/LF is the way lines are separated in Windows, that's what PowerShell cmdlets use to combine lines into a file (you get the same behavior with Out-File, not surprisingly). But you already knew that. Now what's the solution?

One way is to use the -join operator to join all the lines into a single string consisting of the lines separated by LF characters, and pipe that string to Set-Content:

(Get-Content $file | %{
  if ($_.indexOf("|~|") -eq -1) {
     $_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""
  } else {
    $_ -replace " ", " "
  }
}) -join "`n" | Set-Content  $temppath

Although this joins all the lines into a single string that needs to be stored in memory, I strongly suspect that this will work better for you, because the OutOfMemoryException is unlikely to be a limitation of system resources with 500MB files, so it's probably a limitation the .NET class.

However, if that still gives you memory errors, or it works but is too taxing on system resources, you can add the lines to the file one at a time in each iteration of the pipeline by using System.IO.File's AppendAllText method to append the lines without adding a newline (which piping to Out-File -Append would do), and tacking a "`n" onto each one:

Get-Content $file | %{
  [System.IO.File]::AppendAllText($temppath, $(
    (if ($_.indexOf("|~|") -eq -1) {
       $_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""
    } else {
      $_ -replace " ", " "
    }) + "`n"
  ))
}

The would be slower, but drastically less memory-intensive.

Note, BTW, that gc $file.FullName is redundant, because a FileInfo object is implicitly converted to a string as its FullName property, so gc $file is sufficient.

Upvotes: 2

Related Questions