Reputation: 81
I want to find a piece of text in a large xml file and want to replace with some other text. The size of the file is around (50GB). I want to do this in command line. I am looking at PowerShell and want to know if it can handle the large size.
Currently I am trying something like this but it does not like it
Get-Content C:\File1.xml | Foreach-Object {$_ -replace "xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"", ""} | Set-Content C:\File1.xml
The text I want to replace is xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
with an empty string ""
.
Questions
Thanks
Upvotes: 8
Views: 19380
Reputation: 6496
Here is a variation of the answer from @Digital_Coyote to add a buffer. During testing using the buffer increased the RAM usage to ~700MB at times (with 1000 row buffer, this usage will vary based on the size of each row in your file) but sped up the processing significantly testing with a 500MB file. You can adjust the $flushCnt
value which represents the number of rows to buffer before writing them to the file. The smaller the $flushCnt
, the less RAM the process will use.
As mentioned by others, piping directly to Set-Content
used a huge amount of RAM for very large files, so that is why I'm using a buffer instead.
$sourceFile = 'J:\BacPacs\model.raw.xml'
$destinationFile = 'J:\BacPacs\model.replaced.xml'
$flushCnt = 1000;
$searchForString = '<Property Name="AutoDrop" Value="True" />';
$replaceWithString = '';
$buffer = @();
$buffCnt = 0;
Get-Content -LiteralPath $sourceFile -ReadCount 1000 | %{
$buffer += $_.Replace($searchForString,$replaceWithString);
$buffCnt++;
if($buffCnt -ge $flushCnt)
{
$buffer | Add-Content -LiteralPath $destinationFile
$buffer = @();
$buffCnt = 0;
}
}
#flush anything still remaining in the buffer
if($buffCnt -gt 0)
{
$buffer | Add-Content -LiteralPath $destinationFile
$buffer = @();
$buffCnt = 0;
}
Upvotes: 0
Reputation: 341
Aside from worrying about reading the file in chunks to avoid loading it into memory, you need to dump to disk often enough that you aren't storing the entire contents of the resulting file in memory.
Get-Content sourcefile.txt -ReadCount 10000 |
Foreach-Object {
$line = $_.Replace('http://example.com', 'http://another.example.com')
Add-Content -Path result.txt -Value $line
}
The -ReadCount <number>
sets the number of lines to read at a time. Then the ForEach-Object
writes each line as it is read. For a 30GB file filled with SQL Inserts, I topped out around 200MB of memory and 8% CPU. While, piping it all into Set-Content
at hit 3GB of memory before I killed it.
Upvotes: 14
Reputation: 5984
This is my take on it, building on some of the other answers here:
Function ReplaceTextIn-File{
Param(
$infile,
$outfile,
$find,
$replace
)
if( -Not $outfile)
{
$outfile = $infile
}
$temp_out_file = "$outfile.temp"
Get-Content $infile | Foreach-Object {$_.Replace($find, $replace)} | Set-Content $temp_out_file
if( Test-Path $outfile)
{
Remove-Item $outfile
}
Move-Item $temp_out_file $outfile
}
And called like so:
ReplaceTextIn-File -infile "c:\input.txt" -find 'http://example.com' -replace 'http://another.example.com'
Upvotes: 1
Reputation: 1291
I had a similar need (and similar lack of powershell experience) but cobbled together a complete answer from the other answers on this page plus a bit more research.
I also wanted to avoid the regex processing, since I didn't need it either -- just a simple string replace -- but on a large file, so I didn't want it loaded into memory.
Here's the command I used (adding linebreaks for readability):
Get-Content sourcefile.txt
| Foreach-Object {$_.Replace('http://example.com', 'http://another.example.com')}
| Set-Content result.txt
Worked perfectly! Never sucked up much memory (it very obviously didn't load the whole file into memory), and just chugged along for a few minutes then finished.
Upvotes: 15
Reputation: 201652
It does not like it because you can't read from a file and write back to it at the same time using Get-Content/Set-Content. I recommend using a temp file and then at the end, rename file1.xml to file1.xml.bak and rename the temp file to file1.xml.
.\myscript.ps1
and if it takes parameters then c:\users\joe\myscript.ps1 c:\temp\file1.xml
.In general for regexes I would use single quotes if you don't need to reference PowerShell variables. Then you only need to worry about regex escaping and not PowerShell escaping as well. If you need to use double-quotes then the back-tick character is the escape char in double-quotes e.g. "`$p1 is set to $ps1". In your example single quoting simplifies your regex to (note: forward slashes aren't metacharacters in regex):
'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"'
Absolutely you want to stream this since 50GB won't fit into memory. However, this poses an issue if you process line-by-line. What if the text you want to replace is split across multiple lines?
Upvotes: 4
Reputation: 95
The escape character in powershell strings is the backtick ( ` ), not backslash ( \ ). I'd give an example, but the backtick is also used by the wiki markup. :(
The only thing you should have to escape is the quotes - the periods and such should be fine without.
Upvotes: -3