Reputation: 53
I am doing some file clean up before loading into my data warehouse and have run into a file sizing issue:
(Get-Content -path C:\Workspace\workfile\myfile.txt -Raw) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
My file is about 2GB. I am receiving the following error and not sure how to correct.
Get-Content : Exception of type 'System.OutOfMemoryException' was thrown, ........
I am NOT a coder, but I do like learning so am building my own data warehouse. So if you do respond, keep my experience level in mind :)
Upvotes: 5
Views: 2874
Reputation: 439257
A performant way of reading a text file line by line - without loading the entire file into memory - is to use a switch
statement with the -File
parameter.
A performant way of writing a text file is to use a System.IO.StreamWriter
instance.
As Mathias points out in his answer, using verbatim \"
with the regex-based -replace
operator actually replaces "
alone, due to the escaping rules of regexes. While you could address that with '\\"'
, in this case a simpler and better-performing alternative is to use the [string]
type's Replace()
method, which operates on literal substrings.
To put it all together:
# Note: Be sure to use a *full* path, because .NET's working dir. usually
# differs from PowerShell's.
$streamWriter = [System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
switch -File C:\Workspace\workfile\myfile.txt {
default { $streamWriter.WriteLine($_.Replace('\"', '"')) }
}
$streamWriter.Close()
Note: If you're using an old version of Windows PowerShell, namely version 4 or below, use
New-Object System.IO.StreamWriter 'C:\Workspace\workfile\myfileCLEAN.txt'
instead of
[System.IO.StreamWriter]::new('C:\Workspace\workfile\myfileCLEAN.txt')
Upvotes: 2
Reputation: 1569
Get-Content
loads the whole file into memory.
Try processing line by line to improve memory utilization.
$infile = "C:\Workspace\workfile\myfile.txt"
$outfile = "C:\Workspace\workfile\myfileCLEAN.txt"
foreach ($line in [System.IO.File]::ReadLines($infile)) {
Add-Content -Path $outfile -Value ($line -replace '\\"','"')
}
Upvotes: 0
Reputation: 174690
Get-Content -Raw
makes PowerShell read the entire file into a single string.
.NET can't store individual objects over 2GB in size in memory, and each character in a string takes up 2 bytes, so after reading the first ~1 billion characters (roughly equivalent to a 1GB ASCII-encoded text file), it reaches the memory limit.
Remove the -Raw
switch, -replace
is perfectly capable of operating on multiple input strings at once:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
Beware that -replace
is a regex operator, and if you want to remove \
from a string, you need to escape it:
(Get-Content -path C:\Workspace\workfile\myfile.txt) -replace '\\"', '"' | Set-Content C:\Workspace\workfile\myfileCLEAN.txt
While this will work, it'll still be slow due to the fact that we're still loading >2GB of data into memory before applying -replace
and writing to the output file.
Instead, you might want to pipe the output from Get-Content
to the ForEach-Object
cmdlet:
Get-Content -path C:\Workspace\workfile\myfile.txt |ForEach-Object {
$_ -replace '\\"','"'
} |Set-Content C:\Workspace\workfile\myfileCLEAN.txt
This allows Get-Content
to start pushing output prior to finishing reading the file, and PowerShell therefore no longer needs to allocate as much memory as before, resulting in faster execution.
Upvotes: 2