Reputation: 4736
I have the need to parse through a large pipe-delimited file to count the number of records whose 5th column meets and doesn't meet my criteria.
PS C:\temp> gc .\items.txt -readcount 1000 | `
? { $_ -notlike "HEAD" } | `
% { foreach ($s in $_) { $s.split("|")[4] } } | `
group -property {$_ -ge 256} -noelement | `
ft –autosize
This command does what I want, returning output like this:
Count Name ----- ---- 1129339 True 2013703 False
However, for a 500 MB test file, this command takes about 5.5 minutes to run as measured by Measure-Command. A typical file is over 2 GB, where waiting 20+ minutes is undesirably long.
Do you see a way to improve the performance of this command?
For example, is there a way to determine an optimum value for Get-Content's ReadCount? Without it, it takes 8.8 minutes to complete the same file.
Upvotes: 6
Views: 2449
Reputation: 52689
Just adding another example using StreamReader to read through a very large IIS log file and outputting all unique client IP addresses and some perf metrics.
$path = 'A_245MB_IIS_Log_File.txt'
$r = [IO.File]::OpenText($path)
$clients = @{}
while ($r.Peek() -ge 0) {
$line = $r.ReadLine()
# String processing here...
if (-not $line.StartsWith('#')) {
$split = $line.Split()
$client = $split[-5]
if (-not $clients.ContainsKey($client)){
$clients.Add($client, $null)
}
}
}
$r.Dispose()
$clients.Keys | Sort
A little performance comparison against Get-Content
:
StreamReader: Completed: 5.5 seconds, PowerShell.exe: 35,328 KB RAM.
Get-Content: Completed: 23.6 seconds. PowerShell.exe: 1,110,524 KB RAM.
Upvotes: 2
Reputation: 4736
Using @Gisli's hint, here's the script I ended up with:
param($file = $(Read-Host -prompt "File"))
$fullName = (Get-Item "$file").FullName
$sr = New-Object System.IO.StreamReader("$fullName")
$trueCount = 0;
$falseCount = 0;
while (($line = $sr.ReadLine()) -ne $null) {
if ($line -like 'HEAD|') { continue }
if ($line.split("|")[4] -ge 256) {
$trueCount++
}
else {
$falseCount++
}
}
$sr.Dispose()
write "True count: $trueCount"
write "False count: $falseCount"
It yields the same results in about a minute, which meets my performance requirements.
Upvotes: 4
Reputation: 744
Have you tried StreamReader? I think that Get-Content loads the whole file into memory before it does anything with it.
Upvotes: 4